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Preface 


Statistics is a required course for undergraduate college students in a number of majors. Students in the 
following disciplines are often required to take a course in beginning statistics: allied health careers, biology, 
business, computer science, criminal justice, decision science, engineering, education, geography, geology, 
information science, nursing, nutrition, medicine, pharmacy, psychology, and public administration. This 
outline is intended to assist these students in the understanding of Statistics. The outline may be used as a 
supplement to textbooks used in these courses or a text for the course itself. 


The author has taught such courses for over 25 years and understands the difficulty students encounter with 
Statistics. 1 have included examples from a wide variety of current areas of application in order to motivate an 
interest in learning statistics. As we leave the twentieth century and enter the twenty-first century, an 
understanding of statistics is essential in understanding new technology, world affairs, and the ever-expanding 
volume of knowledge. Statistical concepts are encountered tn television and radio broadcasting, as well as in 
magazines and newspapers. Modern newspapers, such as USA Today, are full of statistical information. The 
sports section is filled with descriptive statistics concerning players and teams performance. The money section 
of USA Today contains descriptive statistics concerning stocks and mutual funds. The life section of USA Today 
often contains summaries of research studies in medicine. An understanding of statistics is helpful in evaluating 
these research summaries. 


The nature of the beginning statistics course has changed drastically in the past 30 or so years. This change 
is due to the technical advances in computing. Prior to the 1960s statistical computing was usually performed 
on mechanical calculators. These were large cumbersome computing devices (compared to today’s hand-held 
calculators) that performed arithmetic by moving mechanical parts. Computers and computer software were no 
comparison to today’s computers and software. The number of statistical packages available today numbers in 
the hundreds. The burden of statistical computing has been reduced to simply entering your data into a data file 
and then giving the correct command to perform the statistical method of interest. 


One of the most widely used statistical packages in academia as well as industrial settings is the package 
called Minitab (Minitab Inc., 3081 Enterprise Drive, State College, PA 16801-3008). I wish to thank Minitab 
Inc. for granting me permission to include Minitab output, including graphics, throughout the text. Most 
modern Statistics textbooks include computer software as part of the text. I have chosen to include Minitab 
because it is widely used and is very friendly. Once a student learns the various data file structures needed to 
use Minitab, and the structure of the commands and subcommands, this knowledge is readily transferable to 
other statistical software. 


The outline contains all the topics, and more, covered in a beginning statistics course. The only 
mathematical prerequisite needed for the material found in the outline is arithmetic and some basic algebra. I 
wish to thank my wife, Lana, for her understanding during the preparation of the book. I wish to thank my 
friend Stanley Wileman for all the computer help he has given me during the preparation of the book. I wish to 
thank Dr. Edwin C. Hackleman of Delta Software, Inc. for his timely assistance as compositor of the final 
camera-ready manuscript. Finally, 1 wish to thank the staff at McGraw-Hill for their cooperation and 
helpfulness. 


LARRY J. STEPHENS 
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Chapter 1 


Introduction 


STATISTICS 


Statistics is a discipline of study dealing with the collection, analysis, interpretation, and 
presentation of data. Statistical methodology is utilized by pollsters who sample our opinions 
concerning topics ranging from art to zoology. Statistical methodology is also utilized by business 
and industry to help control the quality of goods and services that they produce. Social scientists and 
psychologists use statistical methodology to study our behaviors. Because of its broad range of 
applicability, a course in statistics is required of majors in disciplines such as sociology, psychology, 
criminal justice, nursing, exercise science, pharmacy, education, and many others. To accommodate 
this diverse group of users, examples and problems in this outline are chosen from many different 
sources. 


DESCRIPTIVE STATISTICS 


The use of graphs, charts, and tables and the calculation of various statistical measures to 
organize and summarize information is called descriptive statistics. Descriptive statistics help to 
reduce our information to a manageable size and put it into focus. 


EXAMPLE 1.1 The compilation of batting average, runs batted in, runs scored, and number of home runs for 
each player, as well as earned run average, won/lost percentage, number of saves, etc., for each pitcher from the 
official score sheets for major league baseball players is an example of descriptive statistics. These statistical 
measures allow us to compare players, determine whether a player is having an “off year” or “good year,” etc. 


EXAMPLE 1.2 The publication entitled Crime in the United States published by the Federal Bureau of 
Investigation gives summary information concerning various crimes for the United States. The statistical 
measures given in this publication are also examples of descriptive statistics and they are useful to individuals in 
law enforcement. 


INFERENTIAL STATISTICS: POPULATION AND SAMPLE 


The complete collection of individuals, items, or data under consideration in a statistical study 
is referred to as the population. The portion of the population selected for analysis is called the 
sample. Inferential statistics consists of techniques for reaching conclusions about a population 
based upon information contained in a sample. 


EXAMPLE 1.3 The results of polls are widely reported by both the written and the electronic media. The 
techniques of inferential statistics are widely utilized by pollsters. Table 1.1 gives several examples of 
populations and samples encountered in polls reported by the media. The methods of inferential statistics are 
used to make inferences about the populations based upon the results found in the samples and to give an 
indication about the reliability of these inferences. The results of a poll of 600 registered voters might be 
reported as follows: Forty percent of the voters approve of the president’s economic policies. The margin of 
error for the survey is 4%. The survey indicates that an estimated 40% of all registered voters approve of the 
economic policies, but it might be as low as 36% or as high as 44%. 
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Table 1.1 


All registered voters A telephone survey of 600 registered voters 


All owners of handguns A telephone survey of 1000 handgun owners 


Houscholds headed by a single parent The results from questionnaires sent to 2500 
households headed by a single parent 


The CEOs of all private companies The results from surveys sent to 150 CEO's of 
private companies 


EXAMPLE 1.4 The techniques of inferential statistics are applied in many industrial processes to control the 
quality of the products produced. In industrial settings, the population may consist of the daily production of 
toothbrushes, computer chips, bolts, and so forth. The sample will consist of a random and representative 
selection of items from the process producing the toothbrushes, computer chips, bolts, etc. The information 
contained in the daily samples is used to construct contro! charts. The control charts are then used to monitor the 
quality of the products. 


EXAMPLE 1.5 The statistical methods of inferential statistics are used to analyze the data collected in 
research studies. Table 1.2 gives the samples and populations for several such studies. The information 
contained in the samples is utilized to make inferences concerning the populations. If it is found that 245 of 350 
or 70% of prison inmates in a criminal justice study were abused as children, what conclusions may be inferred 
concerning the percent of all prison inmates who were abused as children? The answers to this question are 
found in Chapters 8 and 9. 


Table 1.2 


a a 


All prison inmates A criminal justice study of 350 prison inmates 


Legal aliens living in the United States A sociological study conducted by a university 
researcher of 200 legal aliens 


Alzheimer patients tn the United States A medical study of 75 such patients 
conducted by a university hospital 


Adult children of alcoholics A psychological study of 200 such individuals 


VARIABLE, OBSERVATION, AND DATA SET 


A characteristic of interest concerning the individual elements of a population or a sample is 
called a variable. A variable ts often represented by a letter such as x, y, or z. The value of a variable 
for one particular element from the sample or population is called an observation. A data Set consists 
of the observations of a variable for the elements of a sample. 


EXAMPLE 1.6 Six hundred registered voters are polled and each one is asked if they approve or disapprove 
of the president's economic policies. The variable is the registered voter’s opinion of the president's economic 
policies. The data set consists of 600 observations, Each observation will be the response “approve” or the 
response “do not approve.” If the response “approve” ts coded as the number | and the response “do not 
approve” is coded as 0, then the data set will consist of 600 observations, each one of which is either 0 or |. If x 
is used to represent the variable, then x can assume two values, 0) or I. 
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EXAMPLE 1.7 A survey of 2500 households headed by a single parent is conducted and one characteristic of 
interest is the yearly household income. The data set consists of the 2500 yearly household incomes for the 
individuals in the survey. If y is used to represent the variable, then the values for y will be between the smallest 
and the largest yearly household incomes for the 2500 households. 


EXAMPLE 1.8 The number of speeding tickets issued by 75 Nebraska state troopers for the month of June is 
recorded. The data set consists of 75 observations. 


QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE 


A quantitative variable is determined when the description of the characteristic of interest results 
in a numerical value. When a measurement is required to describe the characteristic of interest or it is 
necessary to perform a count to describe the characteristic, a quantitative variable is defined. A 
discrete variable is a quantitative variable whose values are countable. Discrete variables usually 
result from counting. A continuous variable is a quantitative variable that can assume any numerical 
value over an interval or over several intervals. A continuous variable usually results from making a 
measurement of some type. 


EXAMPLE 1.9 Table 1.3 gives several discrete variables and the set of possible values for each one. In each 
case the value of the variable is determined by counting. For a given box of 100 diabetic syringes, the number of 
defective needles is determined by counting how many of the 100 are defective. The number of defectives found 
must equal one of the 101 values listed. The number of possible outcomes is finite for each of the first four 
variables; that is, the number of possible outcomes are 101, 31, 501, and 51 respectively. The number of 
possible outcomes for the last variable is infinite. Since the number of possible outcomes is infinite and 
countable for this variable, we say that the number of outcomes is countably infinite. 


Sometimes it is not clear whether a variable is discrete or continuous. Test scores expressed as a 
percent, for example, are usually given as whole numbers between 0 and 100. It is possible to give a 
score such as 75.57565. However, this is not done in practice because teachers are unable to evaluate 
to this degree of accuracy. This variable is usually regarded as continuous, although for all practical 
purposes, it is discrete. To summarize, due to measurement limitations, many continuous variables 
actually assume only a countable number of values. 


Table 1.3 


Discrete variable Possible values for the variable 


The number of defective needles in boxes of 100 diabetic 
syringes 
The number of individuals in groups of 30 with a type A 
personality 


The number of surveys returned out of 500 mailed in 
sociological studies 


The number of prison inmates in 50 having finished high 
school or obtained a GED who are selected for criminal 
Justice studies 


The number of times you need to flip a coin before a head Weesiws: 
appears for the first time (there is no upper limit since conceivably 
one might need to flip forever to obtain the 

first head) 
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EXAMPLE 1.10 Table 1.4 gives several continuous variables and the set of possible values for each one. All 
three continuous variables given in Table 1.4 involve measurement, whereas the variables in Example 1.9 all 
involve counting. 


Table 1.4 
The length of prison time served for All the real numbers between a and b, 
individuals convicted of first degree murder where a is the smallest amount of time 
served and b is the largest amount 


The household income for households with All the real numbers between a and 


incomes less than or equal to $20,000 $20,000, where a is the smallest 
household income in the population 


The cholesterol! reading for those All real numbers between 200 and b, 
individuals having cholesterol readings where b is the largest cholesterol reading 
equal to or greater than 200 mg/dl of all such individuals 


QUALITATIVE VARIABLE 


A qualitative variable is determined when the description of the characteristic of interest results 
in a nonnumerical value. A qualitative variable may be classified into two or more categories. 


EXAMPLE 1.11 Table 1.5 gives several examples of qualitative variables along with a set of categories into 
which they may be classified. 


Table 1.5 


Qualitative variable Possible categories for the variable 


Marital status Single, married, divorced, separated 


Gender Male, female 


Crime classification Misdemeanor, felony 


Pain level None, low, moderate, severe 


Personality type Type A, type B 


The possible categories for qualitative variables are often coded for the purpose of performing 
computerized statistical analysis. Marital status might be coded as |, 2, 3, or 4, where | represents 
single, 2 represents married, 3 represents divorced, and 4 represents separated. The variable gender 
might be coded as 0 for female and | for male. The categories for any qualitative variable may be 
coded in a similar fashion. Even though numerical values are associated with the characteristic of 
interest after being coded, the variable is considered a qualitative variable. 


NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT 


There are four levels of measurement or scales of measurements into which data can be 
classified. The nominal scale applies to data that are used for category identification. The nominal 
level of measurement is characterized by data that consist of names, labels, or categories only. 
Nominal scale data cannot be arranged in an ordering scheme. The arithmetic operations of addition, 
subtraction, multiplication, and division are not performed for nominal data. 
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EXAMPLE 1.12 Table 1.6 gives several qualitative variables and a set of possible nominal level data values. 
The data values are often encoded for recording in a computer data file. Blood type might be recorded as I, 2, 3. 
or 4; state of residence might be recorded as 1, 2,... , or 50; and type of crime might be recorded as 0 or I, or | 
or 2, etc. Similarly, color of road sign could be recorded as 1, 2, 3, 4, or 5 and religion could be recorded as 1, 
2, or 3. There is no order associated with these data and arithmetic operations are not performed. For example, 
adding Christian and Moslem (1 + 2) does not give other (3). 


Table 1.6 
ie 
Qualitative variable associated with the variable 
Blood type A, B, AB, O 


State of residence Alabama, ..., Wyoming 


Type of crime Misdemeanor, felony 


Color of road signs in the state of Nebraska Red , white, blue, brown, green 


Religion Christian, Moslem, other 


The ordinal scale applies to data that can be arranged in some order, but differences between 
data values either cannot be determined or are meaningless. The ordinal level of measurement is 
characterized by data that applies to categories that can be ranked. Ordinal scale data can be 
arranged in an ordering scheme. 


EXAMPLE 1.13 Table 1.7 gives several qualitative variables and a set of possible ordinal level data values. 
The data values for ordinal level data are often encoded for inclusion in computer data files. Arithmetic 
operations are not performed on ordinal level data, but an ordering scheme exists. A full-size automobile is 
larger than a subcompact, a lire rated excellent is better than one rated poor, no pain is preferable to any Ievel 
of pain, the level of play in major league baseball is better than the level of play in class AA, and so forth. 


Table 1.7 


Possible ordinal level data values associated 
Qualitative variable with the variable 


Automobile size description Subcompact, compact, intermediate, full-size 
Product rating Poor, good, excellent 
Socioeconomic class Lower, middle, upper 


Pain level None, low, moderate, severe 


Baseball team classification Class A, class AA, class AAA , major league 


The interval scale applies to data that can be arranged in some order and for which differences in 
data values are meaningful. The interval level of measurement results from counting or measuring. 
Interval scale data can be arranged in an ordering scheme and differences can be calculated and 
interpreted. The value zero is arbitrarily chosen for interval data and does not imply an absence of 
the characteristic being measured. Ratios are not meaningful for interval data. 


EXAMPLE 1.14 Stanford-Binet IQ scores represent interval level data. Joe’s IQ score equals 100 and John’s 
IQ score equals 150. John has a higher [Q than Joe; that is, IQ scores can be arranged in order. John’s IQ score 
is 50 points higher than Joe’s IQ score; that is, differences can be calculated and interpreted. However, we 
cannot conclude that John is 1.5 times (150/100 = 1.5) more intelligent than Joe. An IQ score of zero does not 
indicate a complete lack of intelligence. 
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EXAMPLE 1.15 Temperatures represent interval level data. The high temperature on February | equaled 25°F 
and the high temperature on March | equaled 50°F. It was warmer on March | than it was on February |. That 
is, temperatures can be arranged in order. It was 25° warmer on March | than on February |. That is, differences 
may be calculated and interpreted. We cannot conclude that it was twice as warm on March | than it was on 
February |. That is, ratios are not readily interpretable. A temperature of O°F does not indicate an absence of 
warmth. 


EXAMPLE 1.16 Test scores represent interval level data. Lana scored 80 on a test and Christine scored 40 on 
a test. Lana scored higher than Christine did on the test; that is, the test scores can be arranged in order. Lana 
scored 40 points higher than Christine did on the test; that is, differences can be calculated and interpreted. We 
cannot conclude that Lana knows twice as much as Christine about the subject matter. A test score of 0 does not 
indicate an absence of knowledge concerning the subject matter. 


The ratio scale applies to data that can be ranked and for which all arithmetic operations 
including division can be performed. Division by zero is, of course, excluded. The ratio level of 
measurement results from counting or measuring. Ratio scale data can be arranged in an ordering 
scheme and differences and ratios can be calculated and interpreted. Ratio level data has an absolute 
zero and a value of zero indicates a complete absence of the characteristic of interest. 


EXAMPLE 1.17 The grams of fat consumed per day for adults in the United States is ratio scale data. Joe 
consumes 50 grams of fat per day and John consumes 25 grams per day. Joe consumes twice as much fat as 
John per day, since 50/25 = 2. For an individual who consumes 0 grams of fat on a given day, there is a 
complete absence of fat consumed on that day. Notice that a ratio is interpretable and an absolute zero exists. 


EXAMPLE 1.18 The number of 91! emergency calls in a sample of 50 such calls selected from a 24-hour 
period involving a domestic disturbance is ratio scale data. The number found on May 1 equals 5 and the 
number found on June | equals 10. Since 10/5 = 2, we say that twice as many were found on June | than were 
found on May |. For a 24-hour period in which no domestic disturbance calls were found, there is a complete 
absence of such calls. Notice that a ratio is interpretable and an absolute zero exists. 


SUMMATION NOTATION 


Many of the statistical measures discussed in the following chapters involve sums of various 
types. Suppose the number of 91! emergency calls received on four days were 411, 375, 400, and 
478. If we let x represent the number of calls received per day, then the values of the variable for the 
four days are represented as follows: x; = 411, x2 = 375, x3 = 400, and x, = 478. The sum of calls for 
the four days is represented as x; + X2 + X3 + X4 which equals 411 + 375 + 400 + 478 or 1664. The 
symbol xx, read as “the summation of x,” is used to represent x; + X7 + X; + X4. The uppercase Greek 
letter X (pronounced sigma) corresponds to the English letter S and stands for the phrase “the sum 
of.” Using the summation notation, the total number of 911 calls for the four days would be written 
as Lx = 1664. 


EXAMPLE 1.19 The following five values were observed for the variable x: x; = 4, x2 = 5, x3 = 0, x4 = 6, and 
xs = 10. The following computations illustrate the usage of the summation notation. 
Lx = Xp + X2$ XZ + Xt XS =44+54+04+64+ 10=25 
(Lx) = (x) + Xo + Xq + X44 Xs)" = (25)? = 625 
Lx Hx $xp $37 Hxg $ x67 = 47 4 57407 + 67 + 10? = 177 
E(x — 5) = (x) — 5) + (x2 — 5) + (xy — 5) + (xq — 5) + (x5 — 5) 
Xx -5)=(4-5)+(5-5)+(0-5)4+(6-5)4+(10-5)=-14+0-5414+5=2=0 
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EXAMPLE 1.20 The following values were observed for the variables x and y: x; = 1, x2 = 2, x3 =0, x4 =4. 
y; = 2, yo = 1, y; = 4, and y,4 = 5. The following computations show how the summation notation is used for two 
variables. ; 
Uxy = Xryy + X2y2 + XaVa + Mays = EX 242K14+0K444x5= 24 


(Ex (Ly ) = (xy + Xp + X39 + XG MY) + V2 + Ys + ys) = (1424044 24+ 1444+5)=7x 12 = 84 
(Lx? — (Lx)/4) x (Ly? — (Ly)"/4) = (144404 16-774) x (44+ 1 + 16 + 25 - 127 /4) = (8.75) x (10) = 87.5 


COMPUTERS AND STATISTICS 


The techniques of descriptive and inferential statistics involve lengthy repetitive computations as 
well as the construction of various graphical constructs. These computations and graphical 
constructions have been simplified by the development of computer software. These computer 
software programs are referred to as Statistical software packages, or simply statistical packages. 
These statistical packages are large computer programs which perform the various computations and 
graphical constructions discussed in this outline plus many other ones beyond the scope of the 
outline. Statistical packages are currently available for use on mainframes, minicomputers, and 
microcomputers. 

There are currently available numerous statistical packages. Four widely used statistical 
packages are: MINITAB, BMDP, SPSS, and SAS. Many of the figures found in the following 
chapters are MINITAB generated. MINITAB is a registered trademark of Minitab, Inc., 3081 
Enterprise Drive, State College, PA 16801. Phone: 814-238-3280; fax: 814-238-4383; telex: 881612. 
The author would like to thank Minitab Inc. for thetr permission to use output from MINITAB 
throughout the outline. 


Solved Problems 


DESCRIPTIVE STATISTICS AND INFERENTIAL STATISTICS: 
POPULATION AND SAMPLE 


1.1. Classify each of the following as descriptive statistics or inferential statistics. 


(a) The average points per game, percent of free throws made, average number of rebounds 
per game, and average number of fouls per game as well as several other measures 
for players in the NBA are computed. 

(b) Ten percent of the boxes of cereal sampled by a quality technician are found to be under 
the labeled weight. Based on this finding, the filling machine is adjusted to increase the 
amount of fill. 

(c) USA Today gives several pages of numerical quantities concerning stocks listed in AMEX, 
NASDAQ, and NYSE as well as mutual funds listed in MUTUALS. 

(d) Based on a study of 500 single parent households by a social researcher, a magazine 
reports that 25% of all single parent households are headed by a high school dropout. 


Ans. (a) The measurements given organize and summarize information concerning the players and is 
therefore considered descriptive statistics. 


1.2 


Ans. 
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(b) Because of the high percent of boxes of cereal which are under the labeled weight in the sample, a 
decision is made to increase the weight per box for each box in the population. This is an example of 
inferential stalistics. 

(c) The tables of measurements such as stock prices and change in stock prices are descriptive in nature 
and therefore represent descriptive statistics. 

(d) The magazine ts stating a conclusion about the population based upon a sample and therefore this is 
an example of inferential statistics. 


Identify the sample and the population in each of the following scenarios. 


(a) In order to study the response times for emergency 911 calls in Chicago, fifty “robbery tn 
progress” calls are selected randomly over a six-month period and the response times are 
recorded. 

(b) In order to study a new medical charting system at Saint Anthony’s Hospital, a 
representative group of nurses is asked to use the charting system. Recording times and 
error rates are recorded for the group. 

(c) Fifteen hundred individuals who listen to talk radio programs of various types are selected 
and information concerning their education level, income level, and so forth is recorded. 


(a) The 50 “robbery in progress” calls is the sample, and all “robbery in progress” calls in Chicago 
during the six-month period is the population. 

(b) The representative group of nurses who use the medical charting system is the sample and all nurses 
who use the medical charting system at Saint Anthony's is the population. 

(c) The 1500 selected individuals who listen to talk radio programs is the sample and the millions who 
listen nationally is the population. 


VARIABLE, OBSERVATION, AND DATA SET 


1.3 


1.4 


1.5 


In a sociological study involving 35 low-income households, the number of children per 
household was recorded for each household. What is the variable? How many observations are 
in the data set? 


Ans. The variable is the number of children per household. The data set contains 35 observations. 


A national survey was mailed to 5000 households and one question asked for the number of 
handguns per household. Three thousand of the surveys were completed and returned. What is 
the variable and how large is the data set? 


Ans. The variable is the number of handguns per household and there are 3000 observations in the data 
set, 


The number of hours spent per week on paper work was determined for 200 middle level 
managers. The minimum was 0 hours and the maximum was 27 hours. What is the variable? 
How many observations are in the data set? 


Ans. The variable is the number of hours spent per week on paper work and the number of observations 
equals 200. 


QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE 


1.6 


Classify the variables in problems 1.3, |.4, and 1.5 as continuous or discrete. 
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1.7 


1.8 


Ans. The number of children per household is a discrete variable since the number of values this 
variable may assume is countable. The values range from 0 to some maximum value such as 10 or 
15 depending upon the population. 


The number of handguns per household is countable, ranging from 0 to some maximum value and 
therefore this variable is discrete. 


The time spent per week on paper work by middle level managers may be any real number 
between 0 and some upper limit. The number of values possible is not countable and therefore this 
variable is continuous. 


A program to locate drunk drivers is initiated and roadblocks are used to check for individuals 
driving under the influence of alcohol. Let n represent the number of drivers stopped before the 
first drunk driver is found. What are the possible values for n? Classify n as discrete or 
continuous. 


Ans. The number of drivers stopped before finding the first drunk driver may equal 1, 2, 3,..., up to 
an infinitely large number. Although not likely, it is theoretically possible that an extremely large 
number of drivers would need to be checked before finding the first drunk driver. The possible 
values for n are all the positive integers. N is a discrete variable. 


The KSW computer science aptitude test consists of 25 questions. The score reported is 
reflective of the computer science aptitude of the test taker. How would the score likely be 
reported for the test? What are the possible values for the scores? Is the variable discrete or 
continuous? 


Ans. The score reported would likely be the number or percent of correct answers. The number correct 
would be a whole number from 0 to 25 and the percent correct would range from 0 to 100 in steps 
of size 4. However if the test evaluator considered the reasoning process used to arrive at the 
answers and assigned partial credit for each problem, the scores could range from 0 to 25 or 0 to 
100 percent continuously. That is, the score could be any real number between 0 and 25 or any 
real number between 0 and 100 percent. We might say that for all practical purposes, the variable 
is discrete. However, theoretically the variable is continuous. 


QUALITATIVE VARIABLE 


1.9 


1.10 


Which of the following are qualitative variables? 


(a) The color of automobiles involved in several severe accidents 

(b) The length of time required for rats to move through a maze 

(c) The classification of police administrations as city, county, or state 

(d) The rating given to a pizza in a taste test as poor, good, or excellent 

(e) The number of times subjects in a sociological research study have been married 


Ans. The variables given in (a), (c), and (d) are qualitative variables since they result in nonnumerical 
values. They are classified into categories. The variables given in (b) and (e) result in numerical 
values as a result of measuring and counting, respectively. 


The pain level following surgery for an intestinal blockage was classified as none, low, 
moderate, or severe for several patients. Give three different numerical coding schemes that 
might be used for the purpose of inclusion of the responses in a computer data file. Does this 
coding change the variable to a quantitative variable? 
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Ans. The responses none, low, moderate, or severe might be coded as 0, 1, 2, or 3 or 1, 2, 3, or 4 or as 
10, 20, 30, or 40. There is no Itmit to the number of coding schemes that could be used. Coding 
the variable does not change it into a quantilative variable. Many times coding a qualitative 
variable simplifies the computer analysis performed on the variable. 


NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT 


1.11 Indicate the scale of measurement for each of the following variables: racial origin. monthly 
phone bills, Fahrenheit and centigrade temperature scales, military ranks, time, ranking of a 
personality trait, clinical diagnoses, and calendar numbering of the years. 


Ans. racial origin: nominal lime: ratio 
monthly phone bills: ratio ranking of personality trait: ordinal 
temperature scales: interval clinical diagnoses: nominal 
military ranks: ordinal calendar numbering of the years: interval 


1.12 Which scales of measurement would usually apply to qualitative data? 


Ans. nominal or ordinal 


SUMMATION NOTATION 


1.13 The following values are recorded for the variable x: x, = 1.3, x2 = 2.5, x, = 0.7, xq = 3.5. 
Evaluate the following summations: 2x, Lx’, (Zx)’, and 2(x - 9). 
Ans. 2X =X, +X2+%3+X%,=1.342.54+0.743.5 =8.0 
Exe =x $x + xy t xq = 1.37 + 2.57 + 0.77 + 3.5" = 20.68 
(Ex)? = (8.0) = 64.0 
X(x — .5) = (x, — 5) + (x2 — .5) + (x3 — 5) + (xy — 5) = 0.8 + 2.0 4+ 0.2 + 3.0 = 6.0 


1.14 The following values are recorded for the variables x and y: x; = 25.67, x2 = 10.95, x3 = 5.65, 
yi = 3.45, y2 = 1.55, and y; = 3.50. Evaluate the following summations: Zxy, Ex’y’, and 
rxy — Lxdy. 

Ans. xy = X1y) + X2¥2 + X3¥3 = 25.67 x 3.45 + 10.95 x 1.55 + 5.65 x 3.50 = 125.31 
Ex?y? = xy yy + xy’ y2 + Xa'ya = 25.67? x 3.457 + 10.95? x 1.55? + 5.657 x 3.50? = 8522.26 
Xxy — LxLy = 125.31 — 42.27 x 8.50 = -233.99 


1.15 The sum of four values for the variable y equals 25, that is, Ly = 25. If it is known that y, = 2, 
y2 = 7, and y; = 6, find yy. 


Ans. Ly =25=2+7+6+ yg, or 25 = 15 + yy. From this, we see that y, must equal 10. 
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Supplementary Problems 


DESCRIPTIVE STATISTICS AND INFERENTIAL STATISTICS: POPULATION AND SAMPLE 


1.16 


Classify each of the following as descriptive statistics or inferential statistics. 


(a) The Nielsen Report on Television utilizes data from a sample of viewers to give estimates of average 
viewing time per week per viewer for all television viewers. 

(b) The U.S. National Center for Health Statistics publication entitled Vital Statistics of the United 
States \ists the leading causes of death in a given year. The estimates are based upon a sampling of 
death certificates. 

(c) The Omaha World Herald lists the low and high temperatures for several American cities. 

(d) The number of votes a presidential candidate receives are given for each state following the 
presidential election. 

(e) The National Household Survey on Drug Abuse gives the current percentage of young adults using 
different types of drugs. The percentages are based upon national samples. 


Ans. (a) inferential statistics (b) inferential statistics (c) descriptive statistics 
(d) descriptive statistics (e) tnferential statistics 


Classify each of the following as a sample or a population. 


(a) all diabetics in the United States 

(b) a group of 374 individuals selected for a New York Times/CBS news poll 

(c) all owners of Ford trucks 

(d) all registered voters in the state of Arkansas 

(e) a group of 22,000 physicians who participate in a study to determine the role of aspirin in preventing 
heart attacks 


Ans. (a) population (b) sample (c) population (d) population (e) sample 


VARIABLE, OBSERVATION, AND DATA SET 


1.18 


1.19 


Changes in systolic blood pressure readings were recorded for 325 hypertensive patients who were 
participating in a study involving a new medication to control hypertension. Larry Doe ts a patient in the 
study and he experienced a drop of 15 units in his systolic blood pressure. What statistical term is used to 
describe the change in systolic blood pressure readings? What does the number 325 represent? What term 
is used for the 15-unit drop is systolic blood pressure? 


Ans. The change in blood pressure is the variable, 325 is the number of observations in the data set, and 
15-unit drop in blood pressure is an observation. 


Table 1.8 gives the fasting blood sugar reading for five patients at a smat! medical clinic. What is the 
variable? Give the observations that comprise this data set. 


Table 1.8 


Fasting blood sugar reading 


Sam Alcorn 
Susan Collins 


Larry Halsey 
Bill Samuels 
Lana Williams 
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Ans. The variable is the fasting blood sugar reading for a patient. The observations are 135, 157, 168, 
120, and 160. 


1.20 A sociological study involving a minority group recorded the educational level of the participants. The 


educational level was coded as follows: less than high school was coded as 1, high school was coded as 2, 
college graduate was coded as 3, and postgraduate was coded as 4. The results were: 


What is the variable? How many observations are in the data set? 


Ans. The variable is the educational level of a participant. There are 46 observations in the data set. 


QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE 
1.21 Classify the variables in problems 1.18. 1.19, and 1.20 as discrete or continuous. 


Ans. The variables in problems 1.18 and 1.19 are continuous. The variable in problem 1.20 is not a 
quantitative variable. 


1.22 A die is tossed until the face 6 turns up on a toss. The variable x equals the toss upon which the face 6 
first appears. What are the possible values that x may assume? Is x discrete or continuous? 


Ans. x may equal any positive integer, and it is therefore a discrete variable. 
1.23 Is it possible for a variable to be both discrete and continuous? 


Ans. no 


QUALITATIVE VARIABLE 
1.24 Give five examples of a qualitative variable. 
Ans. 1. Classification of government employces 4. Medical specialty of doctors 
2. Motion picture ratings 5. ZIP code 


3. College student classification 


1.25 Which of the following is not a qualitative variable? hair color, eye color, make of computer, personality 
type, and percent of income spent on food 


Ans. percent of income spent on food 


NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT 


1.26 Indicate the scale of measurement for each of the following variables: religion classification; movie 
ratings of 1, 2, 3, or 4 stars; body temperature; weights of runners, and consumer product ratings given 
as poor, average, or excellent. 


Ans. religion: nominal weights of runners: ratio 
movie ratings: ordinal consumer product ratings: ordinal 


body temperature: interval 


1.27. Which scales of measurement would usually apply to quantitative data? 
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Ans. interval or ratio 


SUMMATION NOTATION 


1.28 The following values are recorded for the variable x: x; = 15, x2 = 25, x3 = 10, and x4 = 5S. Evaluate the 
following summations: 2x, Ex. (=x)?, and Xx - 5). 


Ans. <x = 55, Lx” = 975, (Zx)’ = 3025, L(x — 5) = 35 


1.29 The following values are recorded for the variables x and y: x, =17, x2 = 28, x; = 35, y; = 20, y2 = 30, 
and y; = 40. Evaluate the following summations: Lxy, Ex’y*, and Lxy — UxLy. 


Ans. xy = 2580, Ux’y* = 2,781,200, Ixy - UxLy = 4,620 
1.30 Given that x, = 5, x. = 10, y, = 20, and Zxy = 200, find y2. 


Ans. yz = 10 


Chapter 2 


Organizing Data 


RAW DATA 


Information obtained by observing values of a variable is called raw data. Data obtained by 
observing values of a qualitative variable are referred to as qualitative data. Data obtained by 
observing values of a quantitative variable are referred to as quantitative data. Quantitative data 
obtained from a discrete variable are also referred to as discrete data and quantitative data obtained 
from a continuous variable are called continuous data. 


EXAMPLE 2.1 A study is conducted in which individuals are classified into one of sixteen personality types 
using the Myers-Briggs type indicator. The resulting raw data would be classified as qualitative data. 


EXAMPLE 2.2 The cardiac output in liters per minute is measured for the participants in a medical study. The 
resulting data would be classified as quantitative data and continuous data. 


EXAMPLE 2.3 The number of murders per 100,000 inhabitants is recorded for each of several large cities for 
the year 1994. The resulting data would be classified as quantitative data and discrete data. 


FREQUENCY DISTRIBUTION FOR QUALITATIVE DATA 


A frequency distribution for qualitative data lists all categories and the number of elements that 
belong to each of the categories. 


EXAMPLE 2.4 A sample of rural county arrests gave the following set of offenses with which individuals were 


charged: 

rape robbery burglary arson murder robbery rape manslaughter 
arson theft arson burglary theft robbery theft theft 

theft burglary murder murder theft theft theft manslaughter 


manslaughter 


The variable, type of offense, is classified into the categories: rape, robbery, burglary, arson, murder, theft, and 
manslaughter. As shown in Table 2.1, the seven categories are listed under the column entitled Offense, and each 
occurrence of a category is recorded by using the symbol / in order to tally the number of times each offense 
occurs. The number of tallies for each offense is counted and listed under the column entitled Frequency. 
Occasionally the term absolute frequency is used rather than frequency. 


Table 2.1 
Rape 
Robbery 
Burglary 


Arson 
Murder 
Theft 
Manslaughter 
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RELATIVE FREQUENCY OF A CATEGORY 


The relative frequency of a category is obtained by dividing the frequency for a category by the 
sum of all the frequencies. The relative frequencies for the seven categories tn Table 2.1 are shown in 
Table 2.2. The sum of the relative frequencies will always equal one. 


PERCENTAGE 


The percentage for a category is obtained by multiplying the relative frequency for that category 
by 100. The percentages for the seven categories in Table 2.1 are shown in Table 2.2. The sum of the 
percentages for all the categories will always equal 100 percent. 


Table 2.2 
Rape 2/25 = .08 08 x 100 = 8% 
Robbery 3/25 = .12 12x 100 = 12% 
Burglary 3/25 = .12 12 x 100 = 12% 


Arson 3/25 = .12 12 x 100 = 12% 
Murder 3/25 = .[2 12x 100 = 12% 
Theft 8/25 = .32 32 x 100 = 32% 
Manslaughter i) dn aes }2 x 100 = 12% 


BAR GRAPH 


A bar graph is a graph composed of bars whose heights are the frequencies of the different 
categories. A bar graph displays graphically the same information concerning qualitative data that a 
frequency distribution shows tn tabular form. 


EXAMPLE 2.5 The distribution of the primary sites for cancer is given in Table 2.3 for the residents of Dalton 
County. 


Table 2.3 


Primary site Frequenc 


Digestive system 20 
Respiratory 30 
Breast 10 
Genitals 
Urinary tract 


Other 


To construct a bar graph, the categories are placed along the horizontal axis and frequencies are marked along 
the vertical axis. A bar is drawn for each category such that the height of the bar is equal to the frequency for that 
category. A small gap is left between the bars. The bar graph for Table 2.3 is shown in Fig. 2-1. Bar graphs can 
also be constructed by placing the categories along the vertical axis and the frequencies along the horizontal axis. 
See problem 2.5 for a bar graph of this type. 
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30 


20 


Frequency 


10 


Jit oe 


Digestive Respiratory Breast Genitals Urinary = Other 
system system Iract 


Primary site 


Fig. 2-1 
PIE CHART 


A pie chart is also used to graphically display qualitative data. To construct a pie chart, a circle is 
divided into portions that represent the relative frequencies or percentages belonging to different 
categories. 


EXAMPLE 2.6 To construct a pie chart for the frequency distribution in Table 2.3, construct a table that gives 
angle sizes for each category. Table 2.4 shows the determination of the angle sizes for each of the categories in 
Table 2.3. The 360° in a circle are divided into portions that are proportional to the category sizes. The pie chart 
for the frequency distribution in Table 2.3 is shown in Fig. 2-2. 


Table 2.4 
Digestive system ; 360 x .26 = 93.6° 
Respiratory 360 x .40 = 144° 


Breast ae 360 x .13 = 46.8° 
Genitals : 360 x .07 = 25.2° 
Urinary tract : 360 x .07 = 25.2° 
Other ; 360 x .07 = 25.2° 


Primary cancer sites 


Digestive system 
(26.7% ) 


Respiratory system 
(40.0%) 


Other 
(6.7%) 


Urinary tract 
(6.7% ) 


Breast Genitals 
(13.3%) (6.7%) 


Fig. 2-2 
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FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA 


There are many similarities between frequency distributions for qualitative data and frequency 
distributions for quantitative data. Terminology for frequency distributions of quantitative data ts 
discussed first, and then examples illustrating the construction of frequency distributions for 
quantitative data are given. Table 2.5 gives a frequency distribution of the Stanford-Binet intelligence 
test scores for 75 adults. 


Table 2.5 
80-94 
95-109 
110-124 
125-139 
140-154 


IQ score is a quantitative variable and according to Table 2.5, eight of the individuals have an IQ 
score between 80 and 94, fourteen have scores between 95 and 109, twenty-four have scores between 
110 and 124, sixteen have scores between 125 and 139, and thirteen have scores between 140 and 154. 


CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH 


The frequency distribution given in Table 2.5 is composed of five classes. The classes are: 80-94, 
95-109, 110-124, 125-139, and 140-154. Each class has a lower class limit and an upper class limit. 
The lower class limits for this distribution are 80, 95, 110, 125, and 140. The upper class limits are 94, 
109, 124, 139, and 154. 

If the lower class limit for the second class, 95. is added to the upper class limit for the first class, 
94, and the sum divided by 2, the upper boundary for the first class and the ower boundary for the 
second class is determined. Table 2.6 gives all the boundaries for Table 2.5. 

If the lower class limit is added to the upper class limit for any class and the sum divided by 2, the 
class mark for that class 1s obtained. The class mark for a class is the midpoint of the class and is 
sometimes called the class midpoint rather than the class mark. The class marks for Table 2.5 are 
shown in Table 2.6. 

The difference between the boundaries for any class gives the class width for a distribution. The 
class width for the distribution in Table 2.5 1s 15. 


Table 2.6 


Class boundaries Class with 


80-94 79.5-94.5 
95-109 94.5-109.5 


110-124 109,5—124.5 
125-139 124.5-139.5 
140-154 139,.5-154.5 


When forming a frequency distribution, the following general guidelines should be followed: 
1. The number of classes should be between 5 and 15 
2. Each data value must belong to one, and only one, class. 
3. When possible, all classes should be of equal width. 
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EXAMPLE 2.7 Group the following weights into the classes 100 to under 125, 125 to under 150, and so forth: 


li] 120 127 129 130 145 145 150 153 155 160 
16] 165 167 170 171] 174 175 177 179 180 180 
185 185 190 195 195 201 210 220 224 225 230 
245 248 


The weights | 1] and 120 are tallied into the class 100 to under 125. The weights 127, 129, 130, 145 and 145 are 
tallied into the class 125 to under 150 and so forth until the frequencies for all classes are found. The frequency 
distribution for these weights is given in Table 2.7 


Table 2.7 
{00 to under 125 
125 to under 150 
150 to under 175 
175 to under 200 
200 to under 225 
225 to under 250 


When a frequency distribution is given in this form, the class limits and class boundaries may be considered to be 
the same. The class marks are 112.5, 137.5. 162.5, 187.5, 212.5, and 237.5. The class width is 25. 


EXAMPLE 2.8 The price for 500 aspirin tablets is determined for each of twenty randomly selected stores as 
part of a larger consumer study. The prices are as follows: 


2.50 2.95 2.65 3.10 3.15 3.05 3.05 2.60 2.70 2.75 
2.80 2.80 2.85 2.80 3.00 3.00 2.90 2.90 2.85 2.85 


Suppose we wish to group these data into seven classes. Since the maximum price is 3.15 and the minimum price 
is 2.50, the spread in prices is 0.65. Each class should then have a width equal to approximately 1/7 of 0.65 or 
.093. There is a lot of flexibility in choosing the classes while following the guidelines given above. Table 2.8 
shows the results tfa class width equal to 0.10 is selected and the first class begins at the minimum price. 


Table 2.8 
2.50 to 2.59 
2.60 to 2.69 
2.70 ta 2.79 


2.80 to 2.89 
2.90 to 2.99 
3.00 to 3.09 
3.10 to 3.19 


The frequency distribution might also be given in a form such as that shown in Table 2.9. The two different 
ways of expressing the classes shown in Tables 2.8 and 2.9 will result in the same frequencies. 


Table 2.9 


2.50 to less than 2.60 
2.60 to less than 2.70 
2.70 to less than 2.80 
2.80 to Jess than 2.90 
2.90 to less than 3.00 
3.00 to less than 3.10 
3.10 to less than 3.20 
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SINGLE-VALUED CLASSES 


If only a few unique values occur in a set of data, the classes are expressed as a single value rather 
than an interval of values. This typically occurs with discrete data but may also occur with continuous 
data because of measurement constraints. 


EXAMPLE 2.9 A quality technician selects 25 bars of soap from the daily production. The weights in ounces of 
the 25 bars are as follows: 


4.75 4.74 4.74 4.77 4.73 4.75 4.76 4.77 
4.72 4.75 4.77 4.74 4.75 4.77 4.72 4.74 
4.75 4.75 4.74 4.76 4.75 4.75 4.74 4.75 
4.77 


Since only six unique values occur, we will use single-valued classes. The weight 4.72 occurs twice, 4.73 occurs 
once, 4.74 occurs six times, 4.75 occurs nine times, 4.76 occurs twice, and 4.77 occurs five times. The frequency 
distribution is shown in Table 2.10. 


Table 2.10 


2 


HISTOGRAMS 


A histogram is a graph that displays the classes on the horizontal axis and the frequencies of the 
classes on the vertical axis. The frequency of each class is represented by a vertical bar whose height is 
equal to the frequency of the class. A histogram is similar to a bar graph. However, a histogram utilizes 
classes or intervals and frequencies while a bar graph utilizes categories and frequencies. 


EXAMPLE 2.10 A histogram for the aspirin prices in Table 2.9 is shown in Fig. 2-3. 


Frequency 
Ln > 


RO 


Sal. = ie =—is so als = =ak = = shoo 
200 2.65. 2775. 285 2.95 3:05- 3.415 
Price 


Fig. 2-3 
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A symmetric histogram is one that can be divided into two pieces such that each is the mirror 
image of the other. One of the most commonly occurring symmetric histograms is shown in Fig. 2-4. 
This type of histogram is often referred to as a mound-shaped histogram or a bell-shaped histogram. A 
symmetric histogram in which each class has the same frequency ts called a uniform or rectangular 
histogram. A skewed to the right histogram has a longer tail on the right side. The histogram shown in 
Fig. 2-5 is skewed to the right. A skewed to the left histogram has a longer tail on the left side. The 
histogram shown in Fig. 2-6 is skewed to the left. 
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> 
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Fig. 2-5 
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Classes 
Fig. 2-6 
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CUMULATIVE FREQUENCY DISTRIBUTIONS 


A cumulative frequency distribution gives the total number of values that fall below various class 
boundaries of a frequency distribution. 


EXAMPLE 2.11 Table 2.11 shows the frequency distribution of the contents in milliliters of a sample of 25 one- 
liter bottles of soda. Table 2.12 shows how to construct the cumulative frequency distribution that corresponds to 
the distribution in Table 2.11. 


Table 2.11 


970 to less than 990 
990 to less than 1010 


1010 to less than 1030 
1030 to less than 1050 
1050 to less than 1070 


CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS 


A cumulative relative frequency is obtained by dividing a cumulative frequency by the total 
number of observations in the data set. The cumulative relative frequencies for the frequency 
distribution given in Table 2.11 are shown in Table 2.12. Ciwnulative percentages are obtained by 
multiplying cumulative relative frequencies by 100. The cumulative percentages for the distribution 
given in Table 2.11 are shown in Table 2.12. 


Table 2.12 


Cumulative 
Contents less than Someta frequenc relative frequenc Cumulative percentage 


0/25 =0 
§/25:=.20 


See 15 15/25 = .60 
15 +5 =20 20/25 = .80 
20 + 3 = 23 23/25 = .92 
23 +2=25 25/25 = 1.00 


OGIVES 


An ogive is a graph in which a point is plotted above each class boundary at a height equal to the 
cumulative frequency corresponding to that boundary. Ogives can also be constructed for a cumulative 
relative frequency distribution as well as a cumulative percentage distribution. 


EXAMPLE 2.12 The ogive corresponding to the cumulative frequency distribution in Table 2.12 is shown in 
Fig. 2-7. 
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25 


20 
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Frequency 10 


tA 


0 
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970 980 990 {000 1010 1020 1030 1040 1050 1060 1070 
Contents 


Fig. 2-7 


STEM-AND-LEAF DISPLAYS 


In a stem-and-leaf display each value is divided into a stem and a leaf. The leaves for each stem are 
shown separately. The stem-and-leaf diagram preserves the information on individual observations. 


EXAMPLE 2.13 The following are the California Achievement Percentile Scores (CAT scores) for 30 seventh- 
grade students: 


50 65 710 35 A) 57 66 65 70 35 
29 34 44 56 66 60 44 50 58 46 
67 78 79 47 35 36 44 57 60 57 


A stem-and-leaf diagram for these CAT scores is shown in Fig. 2-8. 


| Stem | Leaves 
9 
35556 

044467 
0067778 
0055667 
0089 


Fig. 2-8 


Solved Problems 


RAW DATA 


2.1. Classify the following data as either qualitative data or quantitative data. In addition, classify the 
quantitative data as discrete or continuous. 


(a) The number of times that a movement authority is sent to a train from a relay station 1s 
recorded for several trains over a two-week period. The movement authority, which is an 
electronic transmission, is sent repeatedly until a return signal is received from the train. 
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(b) A physician records the follow-up condition of patients with optic neuritis as improved, 
unchanged, or worse. 

(c) A quality technician records the length of material in a roll product for several products 
selected from a production line. 

(d) The Bureau of Justice Statistics Sourcebook of Criminal Justice Statistics in reporting on the 
daily use within the last 30 days of drugs among young adults lists the type of dmg as 
marijuana, cocaine, or stimulants (adjusted). 

(e) The number of aces in five-card poker hands is noted by a gambler over several weeks of 
gambling at a casino. 


Ans. (a) The number of times that the moving authority must be sent is countable and therefore these data 

are quantitative and discrete. 

(b) These data are categorical or qualitative. 

(c) The length of matertal can be any number within an interval of values, and therefore these data 
are quantitative and continuous. 

(ad) These data are categorical or qualitative 

(e) Each value in the data set would be one of the five numbers 0. I, 2, 3, or 4. These data are 
quantitative and discrete. 


FREQUENCY DISTRIBUTION FOR QUALITATIVE DATA 


2.2 The following list gives the academic ranks of the 25 female faculty members at a small liberal 
arts college: 


instructor 
associate professor 
full professor 

full professor 
assistant professor 
full professor 


assistant professor 
assistant professor 
associate professor 
associate professor 
assistant professor 
assistant professor 


assistant professor 
associate professor 
instructor 

assistant professor 
associate professor 
assistant professor 


instructor 
assistant professor 
assistant professor 
instructor 
assistant professor 
assistant professor 


associate professor 
Give a frequency distribution for these data. 


Ans. The academic ranks are tallied into the four possible categories and the results are shown in Table 
21% 


Table 2.13 


Academic rank Frequenc 


Full professor 
Associate professor 
Assistant: professor 
Instructor 


RELATIVE FREQUENCY OF A CATEGORY AND PERCENTAGE 


2.3 Give the relative frequencies and percentages for the categories shown in Table 2.13. 
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Ans. Each frequency in Table 2.13 is divided by 25 to obtain the relative frequencies for the categories. 
The relative frequencies are then multiplied by 100 to obtain percentages. The results are shown in 
Table 2.14. 


Table 2.14 


Academic rank Relative frequenc Percentage 


Full professor 
Associate professor 
Assistant professor 

Instructor 


2.4 Refer to Table 2.14 to answer the following. 


(a) What percent of the female faculty have a rank of associate professor or higher? 
(b) What percent of the female faculty are not full professors? 
(c) What percent of the female faculty are assistant or associate professors? 


Ans. (a) 24% + 12% = 36% (b) 16% + 48% + 24% = 88% (c) 48% + 24% = 72% 


BAR GRAPHS AND PIE CHARTS 


2.5 The subjects in an eating disorders research study were divided into one of three different 
groups. Table 2.15 gives the frequency distribution for these three groups. Construct a bar graph. 


Table 2.15 


Group Frequenc 


Bulimic 


Anorexic 
Control 


Ans. The bar graph for the distribution given in Table 2.15 is shown in Fig. 2-9. 


Control | | 
i 


Group 


0 10 20 30 40 50 
Frequency 
Fig. 2-9 
2.6 Construct a pie chart for the frequency distribution given in Table 2.15. 


Ans. Table 2.16 illustrates the determination of the angles for each sector of the pie chart. 
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Table 2.16 


Bulimic 3 360 x .3 = 108° 
Anorexic ie) 360 x .5 = 180° 
Control 2 360 x .2 = 72° 


Ans. The pie chart for the distribution given in Table 2.15 is shown in Fig. 2-10. 


Pie chart for eating disorder subjects 


Bulimic (30.0%) 


Anorexic (50.0%) 


Control (20.0%) 
Fig. 2-10 


2.7 A survey of 500 randomly chosen individuals is conducted. The individuals are asked to name 
their favorite sport. The pie chart in Fig. 2-1! summarizes the results of this survey. 


Pie chart for favorite sport 


Other (5.0%) Baseball (30.0%) 


Hockey (10.0%) 


Golf (10.0%) 


Basketball (20.0%) Football (25.0%) 


Fig. 2-11 


(a) How many individuals in the 500 gave baseball as their favorite sport? 
(b) How many gave a sport other than basketball as their favorite sport? 
(c) How many gave hockey or golf as their favorite sport? 


Ans. (a) .3x 500 = 150 (b) .8 x 500 = 400 (c) .2x 500 = 100 
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FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA: 
CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH 


2.8 Table 2.17 gives the frequency distribution for the cholesterol values of 45 patients tn a cardiac 


rehabilitation study. Give the lower and upper class limits and boundaries as well as the class 
marks for each class. 


Table 2.17 
Cholesterol value 
170 to 189 


190 to 209 
210 to 229 
230 to 249 
250 to 269 


Ans. Table 2.18 gives the limits, boundaries, and marks for the classes in Table 2.17. 


Table 2.18 


Lower Upper 
Class Lower limit | Upper limit | boundar boundar Class mark 


170 to 189 


190 to 209 
210 to 229 
230 to 249 
250 to 269 


2.9 The following data set gives the yearly food stamp expenditure in thousands of dollars for 25 


households in Alcor County: 


2.3 1.9 lt 3.2 pe [5 0.7 Z5 2 3.1 2.5 
2.0 od 1.9 Die 1.2 1:3 1.7 2.9 3.0 3.2 1.7 
2.2 251 2.0 


Construct a frequency distribution consisting of six classes for this data set. Use 0.5 as the lower 
limit for the first class and use a class width equal to 0.5. 


Ans. The first class would extend from 0.5 to 0.9 since the desired lower limit is 0.5 and the desired class 
width is 0.5. Note that the class boundaries are 0.45 and 0.95 and therefore the class width equals 
0.95 - 0.45 or 0.5. The frequency distribution is shown in Table 2.19. 


Table 2.19 


0.5 to 0.9 
1.0 to 1.4 
1.5 to 1.9 
2.0 to 2.4 
2.5 to 2.9 
3.0 to 3.4 


2.10 Express the frequency distribution given in Table 2.19 using the “less than” form for the classes. 


Ans. The answer is shown in Table 2.20. 
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Table 2.20 
0.5 to less than 1.0 
1.0 to less than 1.5 
1.5 to less than 2.0 
2.0 to less than 2.5 
2.5 to less than 3.0 
3.0 to less than 3.5 


2.11 The manager of a convenience store records the number of gallons of gasoline purchased for a 
sample of customers chosen over a one-week period. Table 2.21 lists the raw data. Construct a 
frequency distribution having five classes, each of width 4. Use 0.000 as the lower limit of the 
first class. 


Ans. 


Table 2.21 


| 12.357 | 19.900 | 17.500 | 12.000 | 8.000 | 16.000 _| 
| 15.500 | 18.500 | 10.000 | 16.500 | 6.000 14.675_ | 
: 12. : 
: ; 3: 
’ 5. 


3 


: p 
|_1.000_ | 14400 {| 7.500 | 6.650 [| 17.890 | 19.500 _| 


When the data are tallied into the five classes, the frequency distribution shown in Table 2.22 is 
obtained. 


8 
= 


Table 2.22 


Gallons Frequenc 


0.000 to 3.999 3 
8 

3 

20 

14 


4.000 to 7,999 

8.000 to 11.999 
12.000 to 15.999 
16.000 to 19.999 


2.12 Using the data given in Table 2.21, form a frequency distribution consisting of the classes 0 to 
less than 4, 4 to less than 8, 8 to less than [2, 12 to less than 16, and 16 to less than 20. Compare 
this frequency distribution with the one given in Table 2.22. 


Ans. 


The frequency distribution is given in Table 2.23. Table 2.23 may have more popular appeal than 
Table 2.22. 
Table 2.23 


0 to less than 4 
4 to less than 8 
8 to less than 12 
{2 to less than 16 
16 to less than 20 


SINGLE-VALUED CLASSES 


2.13 The Food Guide Pyramid divides food into the following six groups: Fats, Oils, and Sweets 
Group; Milk, Yogurt, and Cheese Group; Vegetable Group; Bread, Cereal, Rice, and Pasta 
Group; Meat, Poultry, Fish, Dry Beans, Eggs, and Nuts Group; Fruit Group. One question in a 
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nutrition study asked the individuals in the study to give the number of groups included in their 
daily meals. The results are given below: 


6 4 5 4 4 3 4 5 5 5 
6 5 4 3 6 6 6 5 2 3 
4 5 6 4 5 5 5 6 5 6 
5 


Give a frequency distribution for these data. 


Ans. A frequency distribution with single-valued classes is appropriate since only five unique values 
occur. The frequency distribution is shown in Table 2.24. 


Table 2.24 


Number ottood prOoUups 


: 
4 
5 : 
6 


2.14 A sociological study involving Mexican-American women utilized a 50-question survey. One 
question concerned the number of children living at home. The data for this question are given 


below: 
5 Z 3 0 4 6 2 J ] 2 
3 3 4 4 5 5 > 3 3 3 
3 4 4 4 5 5 2 3 4 4 
5 


Give a frequency distribution for these data. 


Ans. Since only a small number of unique values occur, the classes will be chosen to be single valued. The 
frequency distribution is shown in Table 2.25. 


Table 2.25 


Number of children 


: 
4 
8 
8 
7 
l 


2.15 Construct a histogram for the frequency distribution shown in Table 2.23. 


HISTOGRAMS 


Ans. The histogram for the frequency distribution in Table 2.23 is shown in Fig. 2-12. 
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Fig. 2-12 
2.16 Construct a histogram for the frequency distribution given in Table 2.24. 


Ans. The histogram for the frequency distribution in Table 2.24 is shown in Fig. 2-13. 


10 


Frequency 


at 


T 


3 4 5 6 
Number of food groups 


er) T =, 


Fig. 2-13 


CUMULATIVE FREQUENCY DISTRIBUTIONS 


2.17 The Beckmann-Beal mathematics competency test ts administered to 150 high school students 
for an educational study. The test consists of 48 questions and the frequency distribution for the 
scores is given in Table 2.26. Construct a cumulative frequency distribution for the scores. 


Table 2.26 


Beckmann-Beal score 


5 
15 
20 
30 
50 
30 


Ans. The cumulative frequency distribution is shown in Table 2.27. 
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Table 2.27 


Scores less than Cumulative frequenc 
0 


2.18 Table 2.28 gives the cumulative frequency distribution of reading readiness scores for 35 
kindergarten pupils. 


Table 2.28 


(a) How many of the pupils scored 80 or higher? 
(b) How many of the pupils scored 60 or higher but lower than 80? 
(c) How many of the pupils scored 50 or higher? 
(d) How many of the pupils scored 90 or lower? 


Ans. (a) 35-30=5 (b) 30-5=25 (c) 35 (d) 35 


CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS 


2.19 Give the cumulative relative frequencies and the cumulative percentages for the reading 
readiness scores in Table 2.28. 


Ans. The cumulative relative frequencies and cumulative percentages are shown in Table 2.29. 


Table 2.29 
Cumulative relative 
Scores less than frequenc Cumulative percentages 


OGIVES 


2.20 Table 2.30 gives the cumulative frequency distribution for the daily breast-milk production in 
grams for 25 nursing mothers in a research study. Construct an ogive for this distribution. 
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Table 2.30 


Daily production Cumulative frequency 
less than 


Ans. The ogive curve for this distribution is shown in Fig. 2-14. 


a 


500 550 600 650 700 750 
Grams 


Fig. 2-14 


2.21 Construct an ogive curve for the cumulative relative frequency distribution that corresponds to 
the cumulative frequency distribution in Table 2.30. 


Ans. Each of the cumulative frequencies in Table 2.30 is divided by 25 and the cumulative relative 
frequencies 0, .12, .44, .80, .88, and 1.00 are determined. Using these, the cumulative relative 
frequency distribution shown in Fig. 2-15 can be constructed. 


1.0 


Cumulative relative frequency 
°o 
A 


Fig. 2-15 


2.22 Construct an ogive curve for the cumulative percentage distribution that corresponds to the 
cumulative frequency distribution in Table 2.30. 
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Ans. The cumulative percentages are obtained by multiplying the cumulative relative frequencies given in 
problem 2.21 by 100 to obtain 0%, 12%, 44%, 80%, 88%, and 100%. These percentages are then 
used to construct the ogive shown in Fig. 2-16. 


100 
Rb 
8 
a 
{0 ) 
= 
a” 
8 507 
3 
S 
& 
3 
UO 
0 rae, oe T a vette ce gay | 
500 550 600 650 700 ~~ 750 
Grams 
Fig. 2-16 


STEM-AND-LEAF DISPLAYS 


2.23 The mathematical competency scores of 30 junior high students participating in an educational 
study are as follows: 


30 35 28 44 33 22 40 38 37 6 
28 29 30 30 40 30 34 37 38 40 
38 34 40 37 23 26 30 45 29 40 


Construct a stem-and-leaf display for these data. Use 2, 3, and 4 as your stems. 


Ans. The stem-and-leaf display is shown in Fig. 2-17. 


Stem 
2 [2368899 

3 |0000034456777888 

4 |0000045 


Fig. 2-17 


2.24 Refine the display shown in Fig. 2-17 by separating the leaves into two groups, one consisting of 
0, 1, 2, 3, and 4 and the other consisting of 5, 6, 7, 8, and 9. 


Ans. The solution is shown in Fig, 2-18. 


00000344 
56777888 
000004 

5 


Fig. 2-18 
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2.25 The stem-and-leaf display shown in Fig. 2-19 gives the savings in I5 randomly selected 
accounts. If the total amount in these 15 accounts equals $4340, find the value of the missing 
leaf, x. 


05 50 65 90 
10 2055 x 75 
25 50 70 

15 80 
65 


Fig. 2-19 


Ans. Let A be the amount in the account with the missing Jeaf. Then the following equation must hold. 
(105 + 150 + 165 + 190 + 210 + 220 + 255 + A +275 + 325 + 350 + 370 + 415 + 480 + 565) = 4340 


The solution to this equation is A = 265, which implies that x = 65. 


Supplementary Problems 


RAW DATA 


2.26 Classify the data described in the following scenarios as qualitative or quantitative. Classify the quantitative 
data as either discrete or continuous. 


(a) The individuals in a sociological study are classified into one of five income classes as follows: low, 
low to middle, middle, middle to upper, or upper. 

(b) The fasting blood sugar readings are determined for several individuals in a study involving diabetics. 

(c) The number of questions correctly answered on a 25-item test is recorded for each student in a 
computer science class. 

(d) The number of attempts needed before successfully finding the path through a maze that leads to a 
reward is recorded for several rats in a psychological study. 

(e) The race of each inmate is recorded for the individuals in a criminal justice study. 


Ans. (a) qualitative (b) quantitative, continuous (c) quantitative, discrete 
(d) quantitative, discrete (e) qualitative 


FREQUENCY DISTRIBUTION FOR QUALITATIVE DATA 


2.27 The following responses were obtained when 50 randomly selected residents of a smail city were asked the 
question “How safe do you think your neighborhood ts for kids?” 


very very not sure not at all very not very not sure 
very not sure somewhat not very very not at all not very 
very very very very not very somewhat somewhat 
very very not sure not very not at all not very very 

very not sure very very not very very very 

not very somewhat somewhat very somewhat very very 

not very not at all very very very somewhat very 
somewhat 


Give a frequency distribution for these data. 
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Ans. See Table 2.31. 


Table 2.31 


Response Frequenc 


Very 24 
Somewhat 8 
Not very 9 
Not at all 4 
Not sure 5 


RELATIVE FREQUENCY OF A CATEGORY AND PERCENTAGE 
2.28 Give the relative frequencies and percentages for the categories shown in Table 2.31. 


Ans. See Table 2.32. 


Table 2.32 


Relative frequenc Percentage 


Very 
Somewhat 


Not very 
Not at all 
Not sure 


2.29 Refer to Table 2.32 to answer the following questions. 
(a) What percent of the respondents have no opinion, i.e., responded not sure, on how safe the neighbor- 


hood ts for children? 
(b) What percent of the respondents think the neighborhood is very or somewhat safe for children? 


(c) What percent of the respondents give a response other than very safe? 


Ans. (a) 10% (b) 64% {c) 52% 


BAR GRAPHS AND PIE CHARTS 
2.30 Construct a bar graph for the frequency distribution tn Table 2.31. 
2.31 Construct a pie chart for the frequency distribution given in Table 2.31. 


2.32 The bar graph given Fig. 2-20 shows the distribution of responses of 300 individuals to the question “How 
do you prefer to spend stressful times?” 
(a) What percent preferred to spend time alone? 
(b) What percent gave a response other than “with friends’? 
(c) How many individuals responded “with family,” “with friends,” or “other”? 


Ans. (a) 50% (b) 83.3% (c) 150 
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Fig. 2-20 


FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA: 
CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH 


2.33 Table 2.33 gives the distribution of response times in minutes for 911 emergency calls classified as 
domestic disturbance calls. Give the lower and upper class limits and boundaries as well as the class marks 
for each class. What ts the class width for the distribution? 


Table 2.33 


Response time Frequenc 


Ans. See Table 2.34. 


Table 2.34 


The class width equals 5 minutes. 
2.34 Express the distribution given in Table 2.33 in the “less than” form. 


Ans. See Table 2.35. 
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Table 2.35 


Frequenc 


5 to less than 10 3 
10 to less than 15 7 
15 to less than 20 25 


20 to less than 25 19 
25 to less than 30 14 
30 to less than 35 2 


2.35 Table 2.36 gives the response times in minutes for 50 randomly selected 911 emergency calls classified as 
robbery in progress calls. Group the data into five classes, using 1.0 to 2.9 as your first class. 


Ans. See Table 2.37. 
Table 2.37 


Response time Frequenc 


1.0 to 2.9 7 
3.0 to 4.9 6 
5.0 to 6.9 16 
7.0 to 8.9 14 
9.0 to 10.9 7 


2.37 Refer to the frequency distribution of response times to 911 robbery in progress calls given in Table 2.37 to 
answer the following. 
(a) What percent of the response times are less than seven minutes? 
(b) What percent of the response times are equal to or greater than three minutes but less than seven 
minutes? 
(c) What percent of the response times are nine or more minutes in length? 


Ans. (a) 58% (b) 44% (c) 14% 


2.38 Refer to the frequency distribution given in Table 2.38 to find the following. 
(a) The boundaries for the class ¢ to d 
(b) The class mark for the class e to f 
(c) The width for the class g to 1 
(d) The lower class limit for the class g toi 
(e) The total number of observations 


Table 2.38 


Class Frequenc 
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Ans. (a) (b+0c)/2 and (d + e)/2 (b) (e + f/2 (c) (i +))/2 —(f + g)/2 
(d) g (e) (f, + f2 + fy 4 fy + fs) 


SINGLE-VALUED CLASSES 


2.39 A quality control technician records the number of defective items found in samples of size 50 for each of 
30 days. The data are as follows: 


0 2 3 0 0 0 0 I 2 1 
l 2 0 0 I Z 2 2 0 0 
l 1 l 0 0 0 3 2 0 | 


Give a frequency distribution for these data. 
Ans. See Table 2.39 
Table 2.39 


Number of defectives 


0 13 
l 8 
Zz 7 
3 2 


2.40 The number of daily traffic citations issued over a 100-mile section of Interstate 80 is recorded for each day 
of September. The frequency distribution for these data is shown in Table 2.40. Find the value for x. 


Table 2.40 


5 
7 
10 
X 
3 
3 


2.41 Construct a histogram for the response times frequency distribution given in Table 2.37. 


HISTOGRAMS 


2.42 Construct a histogram for the number of defectives frequency distribution given in Table 2.39. 


CUMULATIVE FREQUENCY DISTRIBUTIONS 


2.43 Give the cumulative frequency distribution for the frequency distribution shown in Table 2.37 


Ans. See Table 2.41. 


Table 2.41 


Response time less than | Cumulative frequenc 
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2.44 Refer to Table 2.41 to answer the following questions. 
(a) How many of the response times are less than 5.0 minutes? 
(b) How many of the response times are 7.0 or more minutes? 
(c) How many of the response times are equal to or greater than 5.0 minutes but less than 9.0 minutes? 


Ans. (a) 13 (b) 21 (c) 30 


CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS 


2.45 Give the cumulative relative frequencies and the cumulative percentages for the cumulative frequency 
distribution shown in Table 2.41. 


Ans. See Table 2.42. 


Table 2.42 


Cumulative relative 
Response time less than frequenc Cumulative percentage 


OGIVES 
2.46 Construct an ogive for the cumulative frequency distribution given in Table 2.41. 


2.47 Construct an ogive for the cumulative relative frequency distribution given in Table 2.42. 


STEM-AND-LEAF DISPLAYS 


2.48 The number of calls per 24-hour period to a 911 emergency number is recorded for 50 such periods. The 
results are shown in Table 2.43. Construct a stem-and-leaf display for these data. 


Table 2.43 


(cae GR ae Te ee 


Ans. See Fig. 2-21. 
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2.49 The number of syringes used per month by the patients of a diabetic specialist is recorded and the results 
are given in Fig. 2-22. Answer the following questions by referring to Fig. 2-22. 
(a) What is the minimum number of syringes used per month by these patients? 
(b) What is the maximum number of syringes used per month by these patients? 


2.50 


bsStettie jeaves <5 5 Sn ee ie 
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Fig. 2-21 


(c) What is the total usage per month by these patients? 
(d) What usage occurs most frequently? 


Ans. 


(a) 30 


| Stem [Leaves 
005555 
0035688 
244488999 
00000055555888 
33355 

005 

00 


Fig. 2-22 


(b) 90 {c) 2700 (d) 60 


39 


Refine the stem-and-leaf display in Fig.2-22 by using either the leaves 0, 1, 2, 3, or 4 or the leaves 5, 6, 7, 


8, or 9 on a particular row. 


Ans. 


See Fig. 2-23. 


| Stem | Leaves 


55555888 
333 
55 
00 
5 
00 
Fig. 2-23 
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Chapter 3 


Descriptive Measures 


MEASURES OF CENTRAL TENDENCY 


Chapter 2 gives several techniques for organizing data. Bar graphs, pie charts, frequency 
distributions, histograms, and stem-and-leaf plots are techniques for describing data. Often times, we 
are interested in a typical numerical value to help us describe a data set. This typical value is often 
called an average value or a measure of central tendency. We are looking for a single number that is 
in some sense representative of the complete data set. 


EXAMPLE 3.1 The following are examples of measures of central tendency: median priced home, average 
cost of a new automobile, the average household income in the United States, and the modal number of 
televisions per household. Each of these examples is a single number, which is intended Io be typical of the 
characteristic of interest. 


MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA 


A data set consisting of the observations for some variable is referred to as raw data or 
ungrouped data. Data presented in the form of a frequency distribution are called grouped data. The 
measures of central tendency discussed in this chapter will be described for both grouped and 
ungrouped data since both forms of data occur frequently. 

There are many different measures of central tendency. The three most widely used measures of 
central tendency are the mean, median, and mode. These measures are defined for both samples and 
populations. 

The mean for a sample consisting of n observations is 


z 
xe (3.1) 
n 
and the mean for a population consisting of N observations is 
Bee (3.2) 
N 


EXAMPLE 3.2 The number of 911 emergency calls classified as domestic disturbance calls in a large metro- 
politan location were sampled for thirty randomly selected 24 hour periods with the following results. Find the 
mean number of calls per 24-hour period. 


25 46 34 45 37 36 40 30 29 37 44 56 50 47 23 
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50 


40 
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EXAMPLE 3.3 The total number of 911 emergency calls classified as domestic disturbance calls last year in a 
large metropolitan location was 14,950. Find the mean number of such calls per 24-hour period if last year was 
not a leap year. 


The median of a set of data is a value that divides the bottom 50% of the data from the top 50% 
of the data. To find the median of a data set, first arrange the data in increasing order. If the number 
of observations is odd, the median is the number in the middle of the ordered list. If the number of 
observations is even, the median is the mean of the two values closest to the middle of the ordered 
list. There is no widely used symbol used to represent the median. Occasionally, the symbol x is 
used to represent the sample median and the symbol {1 is used to represent the population median. 


EXAMPLE 3.4 To find the median number of domestic disturbance calls per 24-hour period for the data in 
Example 3.1, first arrange the data in increasing order. 


19 22 23 25 27 29 29 30 30 34 36 37 37 38 38 
40 40 40 44 45 46 46 47 47 49 50 50 56 56 58 


The two values closest to the middle are 38 and 40. The median is the mean of these two values or 39. 


EXAMPLE 3.5 A bank auditor selects 11 checking accounts and records the amount in each of the accounts. 
The II observations in increasing order are as follows: 


150.25 175.35 195.00 200.00 235.00 240.45 250.55 256.00 275.50 290.10 300.55 


The median is 240.45 since this is the middle value in the ordered list. 


The mode is the value in a data set that occurs the most often. If no such value exists, we say that 
the data set has no mode. If two such values exist, we say the data set is bimodal. If three such values 
exist, we say the data set is trimodal. There is no symbol that is used to represent the mode. 


EXAMPLE 3.6 Find the mode for the data given in Example 3.2. Often it is helpful to arrange the data in 
increasing order when finding the mode. The data, in increasing order, are given in Example 3.4. When the data 
are examined, it is seen that 40 occurs three times, and that no other value occurs that often. The mode is equal 
to 40. 


For a large data set, as the number of classes is increased (and the width of the classes is 
decreased), the histogram becomes a smooth curve. Oftentimes, the smooth curve assumes a shape 
like that shown in Fig. 3-1. In this case, the data set is said to have a bell-shaped distribution or a 
mound-shaped distribution. For such a distribution, the mean, median, and mode are equal and they 
are located at the center of the curve. 


Fig. 3-1 
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For a data set having a skewed to the right distribution, the mode is usually less than the median 
which is usually less than the mean. For a data set having a skewed to the left distribution, the mean 
is usually less than the median which is usually less than the mode. 


EXAMPLE 3.7 Find the mean, median, and mode for the following three data sets and confirm the above 
paragraph. 


Data set 1: 10, 12, 18, 15, 18,20 Data set 2: 2,4, 6, 15, 15,18 Data set 3: 12, 15, 15, 24, 26, 28 


Table 3.1 gives the shape of the distribution, the mean, the median, and the mode for the three data sets. 


Table 3.1 

Bell-shaped 15 15 15 
2 Left-skewed 10 10.5 15 
3 Rig 19.5 15 


MEASURES OF DISPERSION 


In addition to measures of central tendency, it is desirable to have numerical values to describe 
the spread or dispersion of a data set. Measures that describe the spread of a data set are called 
measures of dispersion. 


EXAMPLE 3.8 Jon and Jack are two golfers who both average 85. However, Jon has shot as low as 75 and as 
high as 99 whereas Jack has never shot below 80 nor higher than 90. When we say that Jack is a more consistent 
golfer than Jon is, we mean that the spread in Jack’s scores is less than the spread in Jon's scores. A measure of 
dispersion ts a numerical value that illustrates the differences in the spread of their scores. 


RANGE, VARIANCE, AND STANDARD DEVIATION FOR UNGROUPED DATA 


The range for a data set is equal to the maximum value in the data set minus the minimum value 
in the data set. It is clear that the range is reflective of the spread in the data set since the difference 
between the largest and the smallest value is directly related to the spread in the data. 


EXAMPLE 3.9 Compare the range in golf scores for Jon and Jack in Example 3.8. The range for Jon is 99 - 
75 = 24 and the range for Jack is 90 — 80 = 10. The spread in Jon's scores, as measured by range, is over twice 
the spread in Jack’s scores. 


The variance and the standard deviation of a data set measures the spread of the data about the 
. . . 9 . o 
mean of the data set. The variance of a sample of size n is represented by s° and is given by 


X(x-x) 
s = ——— (3.3) 
n-| 
and the variance of a population of size N is represented by Oo and is given by 
y = 2 
o = 2X7 BW) (3.4) 
N 


The symbol o is the lowercase sigma of the Greek alphabet and Oo is read as sigma squared. 
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EXAMPLE 3.10 The times required in minutes for five preschoolers to complete a task were 5, 10, 15, 3, and 
7. The mean time for the five preschoolers is 8 minutes. Table 3.2 illustrates the computation indicated by 
formula (3.3). The first column lists the observations, x. The second column lists the deviations from the mean, 
x - x. The third column lists the squares of the deviations. The sum at the bottom of the second column ts 
called the sum of the deviations, and is always equal to zero for any data set. The sum at the bottom of the third 
column is referred to as the sum of the squares of the deviations. The sample variance is obtained by dividing 
the sum of the squares of the deviations by n - |, or 5-1 = 4. The sample variance equals 88 divided by 4 which 
is 22 minutes squared. 


Table 3.2 


The variance of the data in Example 3.10 is 22 minutes squared. The units for the variance are 
minutes squared since the terms which are added in column 3 are minutes squared. The square root 
of the variance is called the standard deviation and the standard deviation is measured in the same 
units as the variable. The standard deviation of the times to complete the task is ¥22 or 4.7 minutes. 

The sample standard deviation is 

save (3.5) 


o=Vo" (3.6) 


Shortcut formulas equivalent to formulas (3.3) and (3.4) are useful in computing variances and 
standard deviations. The shortcut formulas for computing sample and population variances are 


and the population standard deviation is 


py? Ey 
epee re on 
and 
vx? (2x) 
sateen ae (3.8) 


EXAMPLE 3.11 Formula (3.7) can be used to find the variance and standard deviation of the times given in 
Example 3.10. The term £x’ is called the sum of the squares and is found as follows: 


Ex’ = 57+ 10° + 157+ 3° +77 = 408 
The term (£x)’ is referred to as the sum squared and is found as follows: 
(Ix) = (5+ 104+ 154347) = 1600 


The variance is given as follows: 
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and the standard deviation 1s 
s=V¥22 =47 
These are the same values we found in Example 3.10 for the variance and standard deviation of the times. 


Since most populations are large, the computation of O° is rarely performed. In practice, the 
population variance (or standard deviation) is usually estimated by taking a sample from the 
population and using s° aS a estimate of O°. The use of n - | rather than n in the denominator of the 
formula for s* enhances the ability of s’ to estimate o”. 

For data sets having a symmetric mound-shaped distribution, the standard deviation is 
approximately equal to one-fourth of the range of the data set. This fact can be used to estimate s for 
bell-shaped distributions. 

Statistical software packages are frequently used to compute the standard deviation as well as 
many other statistical measures. 


EXAMPLE 3.12 The costs for scientific calculators with comparable built-in functions were recorded for 20 
different sales locations. The results were as follows: 


10.50 12.75 tt.00 1650 19.30 20.00 16.50 %3.90 17.50 18.00 
13.50 17.75 18.50 20.00 15.00 1445 1785 15.00 17.50 13.50 


The analysis of these data using Minitab ts as follows. 


MTB > name cl ‘cost 

MTB > set cl 

DATA > 10.50 13.50 12.75 17.75 11.00 18.50 16.50 20.00 19.30 
DATA > 15.00 20.00 14.45 16.50 17.85 13.90 15.00 17.50 17.50 
DATA > 18.00 13.50 

DATA > end 

MTB > describe cl 


Descriptive Statistics 


Variable N Mean Median TrMean StDev SEMean 


cost 20 15.950 16.500 16.028 2.826 0.632 
Variable Min Max Ql! Q3 
cost 10.500 20.000 13.600 17.962 


The cost data are set into column cl, and the command describe c/ produces 10 different descriptive measures. 
The student is encouraged to confirm the values for the mean, median, standard deviation, minimum, and 
maxitnum. The other four measures are described elsewhere in the outline. Even though these data are not 
symmetric mound shaped, a “ballpark” approximation to the standard deviation is obtained by dividing the 
range by 4. One-fourth of the range is 2.375 and the value of the standard deviation is 2.826. 


MEASURES OF CENTRAL TENDENCY AND DISPERSION 
FOR GROUPED DATA 


Statistical data are often given in grouped form, i-e., in the form of a frequency distribution, and 
the raw data corresponding to the grouped data are not available or may be difficult to obtain. The 
articles that appear in newspapers and professional journals do not give the raw data, but give the 
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results in grouped form. Table 3.3 gives the frequency distribution of the ages of 5000 shoplifters in 
a recent psychological study of these individuals. 


Table 3.3 


Frequenc 


The techniques for finding the mean, median, mode, range, variance and standard deviation of 
grouped data will be illustrated by finding these measures for the data given in Table 3.3. The 
formulas and techniques given will be for sample data. Similar formulas and techniques are used for 
population data. 

The mean for grouped data is given by 


> xf 


n 


(3.9) 


x= 


where x represents the class marks, f represents the class frequencies, and n = Xf. 


EXAMPLE 3.13 The class marks in Table 3.3 are x; = 9.5, x2 = 19.5, x3 = 29.5, x4 = 39.5, x5 = 49.5 and the 
frequencies are f; = 750, f, = 2005, f; = 1950, f, = 195, and f; = 100. The sample size n is 5000. The mean is 


95x 750+ 195 x 2005 + 295 x 1950+ 395 x 195+495x 100 _ 116,400 


= 23.3 years 
5000 


x= 


The median for grouped data is found by locating the value that divides the data into two equal 
parts. In finding the median for grouped data, it is assumed that the data in each class is uniformly 
spread across the class. 


EXAMPLE 3.14 The median age for the data in Table 3.3 is a value such that 2500 ages are less than the 
value and 2500 are greater than the value. The median age must occur in the age group 15-24, since 750 are less 
than 15 and 2755 are 24 years or less. The class 15~24 is called the median class since the median must fall in 
this class. Since 750 are less than 15 years, there must be 1750 additional ages in the class 15- 24 that are less 
than the median. In other words, we need to go the fraction 1750/2005 across the class 15-24 to locate the 
median. We give the value 14.5 + (1750/2005) x 10 = 23.2 years as the median age. To summarize, 14.5 is the 
lower boundary of the median class, 1750/2005 is the fraction we must go across the median class to reach the 
median, and 10 is the class width for the median class. 


The modal class is defined to be the class with the maximum frequency. The mode for grouped 
data is defined to be the class mark of the modal class. 


EXAMPLE 3.15 The modal class for the distribution in Table 3.3 is the class 15-24. The mode is the class 
mark for this class that equals 19.5 years. 


The range for grouped data is given by the difference between the upper boundary of the class 
having the largest values minus the lower boundary of the class having the smallest values. 


EXAMPLE 3.16 The upper boundary for the class 45-54 is 54.5 and the lower boundary for the class 5-14 is 
4.5, and the range is 54.5 — 4.5 = 50.0 years. 
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The variance for grouped data 1s given by 


Pg canara! | eee (3.1/0) 


and the standard deviation is given by 


s= vs? (3.11) 


EXAMPLE 3.17 In order to find the variance and standard deviation for the distribution in Table 3.3 using 
(Zxfy? 


n 


formulas (3./0) and (3.11), we first evaluate Lx? f and 


Ex’f =9.5? x 750 + 19.57 x 2005 + 29.57 x 1950 + 39.57 x 195 + 49.5? x 100 = 3,076,350 


2 


xf) 


n 


3 
From Example 3.13, we see that £xf = 116,400, and therefore : = 2,709,792. The variance is 


, 3,076,350 - 2,709,792 
= 733 
4999 


and the standard deviation is 473.3 = 8.6 years. 


CHEBYSHEV’S THEOREM 


Chebyshev’s theorem provides a useful interpretation of the standard deviation. Chebyshev’s 
theorem states that the fraction of any data set lying within k standard deviations of the mean ts at 


I : ; : 
least | — —>, where k is a number greater than |. The theorem applies to either a sample or a 
Ke 


| I 3 
population. If k = 2, this theorem states that at least | - oo =]- ‘ = % or 75% of the data set will 


8 
fall between x—2s and x+2s. Similarly, for k = 3, the theorem states that at least 6 or 89% of the 


data set will fall between x —3s and x +3s. 


EXAMPLE 3.18 The reading readiness scores for a group of 4 and 5 year old children have a mean of 73.5 
and a standard deviation equal to 5.5. At least 75% of the scores are between 73.5 — 2 x 5.5 = 62.5 and 73.5 + 2 
x 5.5 = 84.5. At least 89% of the scores are between 73.5 — 3 x 5.5 = 57.0 and 73.5 + 3 x 5.5 = 90.0. 


EMPIRICAL RULE 


The empirical rule states that for a data set having a bell-shaped distribution, approximately 68% 
of the observations lie within one standard deviation of the mean, approximately 95% of the 
observations lie within two standard deviations of the mean, and approximately 99.7% of the 
observations lie within three standard deviations of the mean. The empirical rule applies to either 
large samples or populations. 


EXAMPLE 3.19 Assuming the incomes for all single parent households last year had a bell-shaped distribu- 
tion with a mean equal to $23,500 and a standard deviation equal to $4,500, the following conclusions follow: 
68% of the incomes lie between $19,000 and $28,000, 95% of the incomes lic between $14,500 and $32,500, 
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and 99.7% of the incomes lie between $10,000 and $37,000. If the shape of the distribution is unknown, then 
Chebyshev’s theorem does not give us any information about the percent of the distribution between $19.500 
and $28,000. However, Chebyshev’s theorem assures us that at least 75% of the incomes are between $14,500 
and $32,500 and at least 89% of the incomes are between $10,000 and $37,000. 


COEFFICIENT OF VARIATION 


The coefficient of variation is equal to the standard deviation divided by the mean. The result is 
usually multiplied by 100 to express it as a percent. The coefficient of variation for a sample is given 
by 


CV = 2x 100% (3.12) 


x 
and the coefficient of variation for a population is given by 


cv = 2 x 100% (3.13) 
oO 


The coefficient of variation is a measure of relative variation, whereas the standard deviation is a 
measure of absolute variation. 


EXAMPLE 3.20 A national sampling of prices for new and used cars found that the mean price for a new car 
is $20,100 and the standard deviation is $6,125 and that the mean price for a used car is $5,485 with a standard 
deviation equal to $2,730. In terms of absolute variation, the standard deviation of price for new cars is more 
than twice that of used cars. However, in terms of relative variation, there is more relative variation in the price 


73 
=x 100= 49.8% and the CV for new cars is 


of used cars than in new cars. The CV for used cars is 


6,125 
20,100 


x 100 = 30.5%. 


Z SCORES 


A z Score is the number of standard deviations that a given observation, x. is below or above the 
mean. For sample data, the z score is 


(3.14) 


and for population data, the z score is 


(3.15) 


EXAMPLE 3.21 The mean salary for deputies in Douglas County is $27,500 and the standard deviation is 
$4,500. The mean salary for deputies in Hall County is $24,250 and the standard deviation is $2,750. A deputy 
who makes $30,000 in Douglas County makes $1,500 more than a deputy does in Hall County who makes 
$28,500. Which deputy has the higher salary relative to the county in which he works? 


For the deputy in Douglas County who makes $30,000, the z score is 
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X~-H  30,000- 27.500 


For the deputy in Hall County who makes $28.500, the z score is 


K-H 28.500—24,250 


When the county of employment is taken into consideration, the $28,500 salary is a higher relative salary than 
the $30,000 salary. 


MEASURES OF POSITION: PERCENTILES, DECILES, AND QUARTILES 


Measures of position are used to describe the location of a particular observation in relation to 
the rest of the data set. Percentiles are values that divide the ranked data set into 100 equal parts. The 
pth percentile of a data set is a value such that at least p percent of the observations take on this value 
or less and at least (100 - p) percent of the observations take on this value or more. Deciles are 
values that divide the ranked data set into 10 equal parts. Quartiles are valucs that divide the ranked 
data set into four equal parts. The techniques for finding the various measures of position will be 
illustrated by using the data in Table 3.4. Table 3.4 contains the aortic diameters measured in 
centimeters for 45 patients. Notice that the data in Table 3.4 are already ranked. Raw data need to be 
ranked prior to finding measures of position. 


Table 3.4 


The percentile for observation x is found by dividing the number of observations less than x by 
the total number of observations and then multiplying this quantity by 100. This percent is then 
rounded to the nearest whole number to give the percentile for observation x. 


EXAMPLE 3.22 The number of observations in Table 3.4 less than 5.5 is 11. Eleven divided by 45 1s .244 and 
.244 multiplied by 100 is 24.4%. This percent rounds to 24. The diameter 5.5 is the 24th percentile and we 
express this as P2y = 5.5. The number of observations less than 5.0 is 9. Nine divided by 45 is .20 and .20 
multiplied by 100 is 20%. Px) = 5.0. The number of observations less than 10.0 is 39. Thirty-nine divided by 45 
is .867 and .867 multiplied by 100 ts 86.7%. Since 86.7% rounds to 87%, we write Pg; = 10.0 


The pth percentile for a ranked data set consisting of n observations is found by a two-step 
(p)(n) 
100 
than i locates the position of the pth percentile in the ranked data set. If i is an integer, the pth 
percentile is the average of the observations in positions it andi + I in the ranked data set. 


procedure. The first step is to compute index 1 = . If 1 1s not an integer, the next integer greater 
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. _  (10)(45) 
EXAMPLE 3.23 To find the tenth percentile for the data of Table 3.4. compute t = a0 = 4.5. The next 


integer greater than 4.5 is 5. The observation in the fifth position in Table 3.4 is 3.6. Therefore, Py = 3.6. Note 
that at least 10% of the data in Table 3.4 are 3.6 or tess (the actual amount is 11.1%) and at least 90% of the 
data are 3.6 or more (the actual amount is 91.1%). For very large data sets, the percentage of observations equal 
to or less than Py» will be very close to 10% and the percentage of observations equal to or greater than Pj will 
be very close to 90%. 


(40)(45) 
EXAMPLE 3.24 To find the forticth percentile for the data in Table 3.4, compute 1 = ———— = 18. The 
100 


forticth percentile is the average of the observations in the 18th and 19th posiuons tn the ranked data set. The 


nae lagu & eck sl we 6.0 + 6.2 
observation in the 18th position is 6.0 and the observation in the 19th position 1s 6.2. Therefore Py) =———— = 


6.1. Note that 40% of the data in Table 3.4 are 6.1 or less and that 60% of the observations are 6.1 or more. 

Deciles and quartiles are determined in the same manner as percentiles, since they may be 
expressed as percentiles. The deciles are represented as D), D2, ... , Dy and the quartiles are repre- 
sented by Q, , Q2 , and Q:. The following equalities hold for deciles and percentiles: 

D, = Pio. D2 = P29 , Da = Pay. Dy = Pao , Ds = Pso . Dp = Poo . D7 = Pro. Da = Pao . Dy = Poo 
The following equalities hold for quartiles and percentiles: 
Qi = Pas . Q> = Pso . Qa = Pas 

From the above definitions of percentiles, deciles, and quartiles, the following equalities also hold: 


Median = Ps = Ds = Q, 


The techniques for finding percentiles, deciles, and quartiles differ somewhat from textbook to 
textbook, but the values obtained by the various techniques are usually very close to one another. 


INTERQUARTILE RANGE 
The interquartile range, designated by IQR, is defined as follows: 


IQR = Qi - Q) (3.16) 


The interquartile range shows the spread of the middle 50% of the data and is not affected by 
extremes in the data set. 


EXAMPLE 3.28 The interquartile range for the aortic diameters in Table 3.4 is found by subtracting the value 
45x 25 
= 11.25 


of Q, from Q; . The first quartile is equal to the 25th percentile and is found by noting that 


and therefore 1 = 12. Q, ts in the 12th position in Table 3.4 and Q, = 5.5. The third quartile is equal to the 75th 


percentile and is found by noting that = 33.75 and therefore 1 = 34. Q: is in the 34th position in Table 


3.4 and Q: = 8.5. The IQR equals 8.5 —5.5 or 3.0 cm. 
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BOX-AND-WHISKER PLOT 


A box-and-whisker plot, sometimes simply called a boxplot is a graphical display in which a box 
extending from Q; to Q: is constructed and which contains the middle 50% of the data. Lines, called 
whiskers, are drawn from Q, to the smallest value and from Q; to the largest value. In addition, a 
vertical line is constructed inside the box corresponding to the median. 


EXAMPLE 3.26 A Minitab generated boxplot for the data in Table 3.4 is shown in Fig. 3-2. For these data, 
the minimum diameter is 3.0 cm, the maximum diameter ts 11.0 cm., Q; = 5.5 cm, Q: = 6.6 cm, and Q; = 8.5 
cm. Because Minitab uses a slightly different technique for finding the first and third quartile, the box extends 
from 5.35 to 8.65, rather than from 5.5 to 8.5. 


3 4 5 6 7 8 9 10 Tt 
Diameter (cm) 


Fig. 3-2 


Another type boxplot, called a modified boxplot, is also sometimes constructed in which possible 
and probable outliers are identified. The modified boxplot is illustrated in problem 3.27. 
Solved Problems 


MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN, AND MODE 
FOR UNGROUPED DATA 


Table 3.5 


Table 3.5 gives the annual returns for 30 randomly selected mutual funds. Problems 3.1, 3.2, and 
3.3 refer to this data set. 


3.1 Find the mean for the annual returns in Table 3.5. 
Ans. For the data in Table 3.5, £x = 455.20, n = 30, and x = 15.17. 


3.2 Find the median for the annual returns tn Table 3.5. 
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3.3 


3.4 


3.5 


Ans. The ranked annual returns are as follows: 


-5.5 -2.5 3.5 4.0 Pe io 10.5 10.5 10.5 10.5 12.0 
125 12.5 12.7 14.0 14.0 14.5 14.5 14.5 17.0 17.5 19.0 
20.2 20.3 22.0 22:5 213 35.5 38.0 40.0 


The median is the average of the 15th and 16th values in the ranked returns or the average of 14.0 
and 14.0, which equals 14.0. 


Find the mode for the annual returns in Table 3.5. 


Ans. By considering the ranked annual returns in the solution to problem 3.2, we see that the observa- 
tion 10.5 occurs more frequently than any other value and is therefore the mode for this data set. 


Table 3.6 gives the distribution of the cause of death due to accidents or violence for white 
males during a recent year. 


Table 3.6 
Cause of death 


Motor vehicle accident 


All other accidents 
Suicide 
Homicide 


What is the modal cause of death due to accidents or violence for white males? Can the mean 
or median be calculated for the cause of death? 


Ans. The modal cause of death due to accidents or violence ts motor vehicle accident. Because this is 
nominal level] data, the mean and the median have no meaning. 


Table 3.7 gives the selling prices in tens of thousands of dollars for 20 homes sold during the 
past month. Find the mean, median, and mode. Which measure is most representative for the 
selling price of such homes? 


Table 3.7 


Ans. The ranked selling prices are: 
50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0 
WIS 913.5 1225 1255 1300 150.0 175.5 340.5 475.5 525.0 


The mean is 154.1, the median is 105.7, and the mode is 100.0. The median is the most 
representative measure. The three selling prices 340.5, 475.5, and 525.0 inflate the mean and make 
it less representative than the median. Generally, the median is the best measure of central 
tendency to use when the data are skewed. 
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MEASURES OF DISPERSION: RANGE, VARIANCE, AND STANDARD 
DEVIATION FOR UNGROUPED DATA 


3.6 


3.7 


3.8 


3.9 


Find the range, variance, and standard deviation for the annual returns of the mutual funds in 
Table 3.5. 


Ans. The range is 40.0 - (-5.5) = 45.5. 


2 
Zx7--—" 10,060 
The variance is given by s? = ————_"\— = pales ener! ae = 108.7275 and the standard 


deviation is 5s = Vs? = ¥108.7275 = 10.43. 


Find the range, variance, and standard deviation for the selling prices of homes in Table 3.7. 


Ans, The range is 525 — 50 = 475. 


x). 3083) 
Se ae 
The variance is given by s? = ieee = aaa aes = 17,770.5026 and the 
n- 
standard deviation is s = fers ¥17,770.5026 = 133.31. 
de _, range 
Compare the values of the standard deviations with for problems 3.6 and 3.7. 


Ans. For problem 3.6, the values are 10.43 and 45.5/4 = 11.38 and for problem 3.7, the values are 
133.31 and 475/4 = 118.75. Even though the distributions are skewed in both problems, the 
values are reasonably close. The approximation is closer for mound-shaped distributions than for 
other distributions. 


What are the chief advantage and the chief disadvantage of the range as a measure of 
dispersion? 


Ans. The chief advantage is the simplicity of computation of the range and the chief disadvantage is that 
it is insensitive to the values between the extremes. 


3.10 The ages and incomes of the 10 employees at Computer Services Inc. are given in Table 3.8. 


Table 3.8 
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Compute the standard deviation of ages and incomes for these employees. Assuming that all 
employees remain with the company 5 years and that each income is multiplied by 1.5 over 
that period, what will the standard deviation of ages and incomes equal 5 years in the future? 


Ans. The current standard deviations are 10.21 years and $9,224.10. Five years from now, the standard 
deviations will equal 10.21 years and $13,836.15. In general, adding the same constant to each 
observation does not affect the standard deviation of the data set and multiplying each observation 
by the same constant multiplies the standard deviation by the constant. 


MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED 
DATA 


3.11 Table 3.9 gives the age distribution of individuals starting new companies. Find the mean, 
median, and mode for this distribution. 


Table 3.9 
1] 
io 
14 


3 


Ans. The mean is found by dividing &xf by n, where Zxf = 24.5 x 1} + 34.5 x 25 + 44.5 x 144+ 54.5 x 
7 + 64.5 x 3 = 2,330 and n = 60. x = 38.8. The median class is the class 30-39. In order to find 
the middle of the age distribution, that is, the age where 30 are younger than this age and 30 are 
older, we must proceed through the 11 individuals in the 20-29 age group and 19 in the 30-39 age 


ae 19 ; 
group. This gives 29.5 + 5 x 10 = 37.1 as the median age. The modal class is the class 30-39, 


and the mode ts the class mark for this class that equals 34.5. 
3.12 Find the range, variance, and standard deviation for the distribution in Table 3.9. 


Ans. The range is 69.5 ~ 19.5 = 50. 


deviation is s = ¥116.497 = 10.8. 


3.13 The raw data corresponding to the grouped data in Table 3.9 is given in Table 3.10. Find the 
mean, median, and mode for the raw data and compare the results with the mean, median, and 
mode for the grouped data found in problem 3.11. 


54 
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Ans. The sum of the raw data equals 2,299, and the mean ts 38.2. This compares with 38.8 for the 
grouped data. The median is seen to be 37, and this compares with 37.1 for the grouped data. The 
mode is seen to be 34 and this compares with 34.5 for the grouped data. This problem illustrates 


that the measures of central tendency for a set of data in grouped and ungrouped form are 
relatively close. 


3.14 Find the range, variance, and standard deviation for the ungrouped data in Table 3.10 and 


compare these with the same measures found for the grouped form in problem 3.12. 


Ans. The range is 66 — 20 = 46. Lx = 2,299, Ex? = 94,685 and n = 60. 


3 (2x) (2,299)° 


Dx 94,685— 
n 


The variance is s* = —————-__ = tence.) = 111.788 and s = 10.6. 
n-1} 59 


These measures of dispersion compare favorably with the measures for the grouped data given in 
problem 3.12. 


CHEBYSHEV’S THEOREM AND THE EMPIRICAL RULE 


345 


3.16 


The mean lifetime of rats used in many psychological experiments equals 3.5 years, and the 
standard deviation of lifetimes is 0.5 year. At least what percent will have lifetimes between 
2.5 years and 4.5 years? At least what percent will have lifetimes between 2.0 years and 5.0 
years? 


Ans. The interval from 2.5 years to 4.5 years is a 2 standard deviation interval about the mean, i.e., k = 2 
in Chebyshev’s theorem. At least 75% of the rats will have lifetimes between 2.5 years and 4.5 
years. The interval from 2.0 years to 5.0 years is a 3 standard deviation interval about the mean. At 
least 89% of the lifetimes fall within this interval. 


The mean height of adult females is 66 inches and the standard deviation is 2.5 inches. The 
distribution of heights is mound-shaped. What percent have heights between: (a) 63.5 inches 
and 68.5 inches? (b) 61.0 inches and 71.0 inches? (c) 58.5 inches and 73.5 inches? 


Ans. 63.5 to 68.5 is a one standard deviation interval about the mean, 61.0 to 71.0 is a two standard 
deviation interval about the mean, and 58.5 to 73.5 is a three standard deviation interval about the 
mean. According to the empirical rule, the percentages are: (a) 68%; (b) 95% and (c) 99.7%. 
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3.17 The mean length of service for Federal Bureau of Investigation (FBI) agents equals 9.5 years 


and the standard deviation is 2.5 years. At least what percent of the employees have between 
2.0 years of service and 17.0 years of service? If the lengths of service have a bell-shaped 
distribution, what can you say about the percent having between 2.0 and 17.0 years of service? 


Ans. The interval from 2.0 years to 17.0 years is a 3 standard deviation interval about the mean, 1.e., k = 
3 in Chebyshev's theorem. Therefore, at least 89% of the agents have lengths of service in this 
interval. If we know that the distribution is bell shaped, then 99.7% of the agents will have lengths 
of service between 2.0 and 17.0 years. 


COEFFICIENT OF VARIATION 


3.18 Find the coefficient of variation for the ages in Table 3.10. 


Ans. From problem 3.13, the mean age is 38.2 years and from problem 3.14 the standard deviation is 


S 10.6 
10.6 years. Using formula (3./2), the coefficient of variation is CV = — x 100% = aes x 100 = 
x 38. 


27.7%. 


3.19 The mean yearly salary of all the employees at Pretty Printing is $42,500 and the standard 


deviation is $4,000. The mean number of years of education for the employees is 16 and the 
standard deviation is 2.5 years. Which of the two variables has the higher relative variation? 


- 


Ans. The coefficient of variation for salaries is CV = x 100 = 9.4% and the coefficient of varia- 


25 
tion for years of education ts CV = a x 100 = 15.6%. Years of education has a higher relative 


variation. 


Z SCORES 


3.20 The mean daily intake of protein for a group of individuals is 80 grams and the standard 


3.21 


deviation is 8 grams. Find the z scores for individuals with the following daily intakes of 
protein: (a) 95 grams; (6) 75 grams; (c) 80 grams. 


—80 75-80 80-80 
= 1.88 (b) z= = -,63 {c) z= ; =0 


Ans. (a) z= 


Three individuals were selected from the group described in problem 3.20 who have daily 
intakes with z scores equal to -1.4, 0.5, and 3.0. Find their daily intakes of protein. 


Ans. If the equation z = is solved for x, the result is x = X + zs. The daily intake corresponding to 


s 
az score of -1.4 is x = 80 + (—1.4)(8) = 68.8 grams. For a z score equal to 0.5, x = 80 + (0.5)(8) = 
84 grams. For a z score equal to 3.0, x = 80 + (3.0)(8) = 104 grams. 


MEASURES OF POSITION: PERCENTILES, DECILES, AND QUARTILES 


3.22 Find the percentiles for the ages 34, 45, and 55 in Table 3.10. 
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Ans. The number of ages less than 34 is 20, and fa x 100 = 33.3%, which rounds to 33%. The age 34 
is the thirty-third percentile. " 
The number of ages less than 45 is 45, and Be x 100 = 75%. The age 45 is the seventy-fifth 
percentile. sa 
The number of ages less than 55 is 53, and sa x 100 = 88.3%, which rounds to 88%. The age 55 
is the eighty-eighth percentile. vs 


3.23 Find the ninety-fifth percentile, the seventh decile, and the first quartile for the age distribution 
given in Table 3.10. 


Ans. To find Pos , compute i = on = on = 57. Pos is the average of the observations in positions 


57 and 58 in the ranked data set, or the average of 58 and 62 which is 60 years. 


100 ~—-:100 
observations in positions 42 and 43 in the ranked data set, or the average of 41 and 44 which is 
42.5 years. 


To find Q; , compute i = —— = ———— = 15. Q, is the average of the observations in positions 


100 100 
15 and 16 in the ranked data set, or the average of 32 and 32 which is 32 years. 


INTERQUARTILE RANGE 


3.24 Find the interquartile range for the annual returns of the mutual funds given in Table 3.5. 


Ans. The ranked annual returns are as follows. 


-5.5 -2.5 3.5 4.0 5.5 75 10.5 10.5 10.5 10.5 12.0 
125 125 127 140 140 145 145 145 170 17.5 19.0 
20.2 20.3 22.0 22.5 27.5 35.5 38.0 40.0 

np — (30)(25) 


The first quartile, which is the same as P25 , is found by computing i = 500 = arena = 7,5 
The next integer greater than 7.5 is 8, and this locates the position of Q, in the ranked data set. 
l= 10.5. 
ee _ . np (3075) on 2G 
The third quartile is found by computing i = a = gore = 22.5, and rounding this to 23. 


The third quartile is found in position 23 in the ranked data set. Q; = 20.2 
IQR = Q; - Q, = 20.2 - 10.5 = 9.7. 


3.25 Find the interquartile range for the selling prices given in Table 3.7. 


Ans. The ranked selling prices are: 


50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0 
111.5 13.55 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0 


100 ~—-:100 
first quartile is the average of the observations in positions 5 and 6 in the ranked data set. 
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Q,= 79.0 : 89.0 = 946 

The third quartile is found by computing i = a = a“ = 15. The third quartile is the 

average of the observations in positions 15 and 16 in the ranked data set. 

Q3= aes = 140.0 IQR = Q;- Q; = 140.0 — 84.0 = 56.0 
BOX-AND-WHISKER PLOT 


3.26 Table 3.11 gives the number of days that 25 individuals spent in house arrest in a criminal 
justice study. Use Minitab to construct a boxplot for these data. 


Table 3.11 


Ans. The solution is shown in Fig. 3-3. It is seen that the minimum value is 25, the maximum value is 
90, Q; = 55, median = 60, and Q; = 83. 


fC. oe es a pak de eC 
20 30 40 50 60 70 80 9% 
Days 


Fig. 3-3 
3.27 Construct a modified boxplot for the selling prices given in Table 3.7. 


Ans. The ranked selling prices are: 


50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0 
11.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0 


In problem 3.25 it is shown that Q, = 84.0, Q; = 140.0, and IQR = 56.0. 


A lower inner fence is defined to be Q, - 1.5 x IQR = 84.0 - 1.5 x 56.0=0.0 

An upper inner fence is defined to be Q; + 1.5 x IQR = 140.0 + 1.5 x 56.0 = 224.0 
A lower outer fence is defined to be Q, — 3 x IQR = 84.0 —3 x 56.0 = -84.0 

An upper outer fence is defined to be Q; + 3 x IQR = 140.0 + 3 x 56.0 = 308.0 


The adjacent values are the most extreme values still lying within the inner fences. The 
adjacent values for the above data are 50.0 and 175.5, since they are the most extreme values 
between 0.0 and 224.0. In a modified boxplot, the whiskers extend only to the adjacent values. 


58 DESCRIPTIVE MEASURES {[CHAP. 3 


Data values that lie between the inner and outer fences are possible outliers and data values 
that lie outside the outer fences are probable outliers. The observations 340.5, 475.5, and 525.0 lie 
outside the outer fences and are called probable outliers. 

A Minitab printout for a boxplot of the data is given in Fig. 3-4. Each probable outlier is 
represented by an asterisk. 


fm — —nbyes? - bee ig eee oe 
0 100 «#200 «63000 «€©6400)—_ 500 
Selling price 


Fig. 3-4 


Supplementary Problems 


MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA 


3.28 Table 3.12 gives the verbal scores for 25 individuals on the Scholastic Aptitude Test (SAT). Find the 
mean verbal score for this set of data. 


Table 3.12 


Ans. &x = 13,027 x= $21.1 
3.29 Find the median verbal score for the data in Table 3.12. 


Ans. The ranked scores are shown below. 


340 350 375 380 400 420 440 445 450 467 495 S500 545 
560 565 580 590 605 625 630 635 635 640 675 680 


The median verbal score is 545. 
3.30 Find the mode for the verbal scores in Table 3.12. 


Ans. 635 
3.31 Give the output produced by the Describe command of Minitab when the data in Table 3.12 are analyzed. 
Ans. Descriptive Statistics 


Variable N Mean Median TrMean StDev SEMean 
SAT 25 521.1 545.0 522.0 108.1 21.6 
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3.32 


Variable Min Max Q!1 Q3 
SAT 340.0 680.0 430.0 627.5 


Table 3.13 gives a stem-and-leaf display for the number of hours per week spent watching TV for a group 
of teenagers. Find the mean, median, and mode for this distribution. What is the shape of the distribution? 


Table 3.13 


O93. 5 


000000000000555 


Ans. The mean, median, and mode are each equal to 20 hours. The distribution has a bel] shape with 
center at 20. 


MEASURES OF DISPERSION: RANGE, VARIANCE, AND STANDARD DEVIATION 
FOR UNGROUPED DATA 


3.33 


3.34 


3.35 


3.36 


3.37 


Find the range, variance, and standard deviation for the SAT scores in Table 3.12. 


Ans. range = 340 s’ = 11,692.91 s= 108.13 

Find the range, variance, and standard deviation for the number of hours spent watching TV given in 
Table 3.13. 

Ans. range = 20 s’ = 18.4213 s= 4.29 

What should your response be if you find that the variance of a data set equals -5.5? 

Ans. Check your calculations, since the variance can never be negative 

Consider a data sect in which all observations are equal. Find the range, variance, and standard deviation 
for this data set. 

Ans. The range, variance, and standard deviation will all equal zero. 

A data set consisting of 10 observations has a mean cqual to 0, and a variance equal to a. Express Ex’ in 
terms of a. 

Ans. x’? =9a 


MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED DATA 


3.38 


Table 3.14 gives the distribution of the words per minute for 60 individuals using a word processor. Find 
the mean, median, and mode for this distribution. 


Table 3.14 


Words per minute Frequenc 


60 


3.40 
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Ans. mean = 71.5, median = 71.5, modes = 64.5 and 74.5 
Find the range, variance, and standard deviation for the distribution in Table 3.14. 
Ans. range = 60 s* = 184.0678 s = 13.57 


A quality control technician records the number of defective units found daily in samples of size 100 for 
the month of July. The distribution of the number of defectives per 100 units is shown in Table 3.15. 
Find the mean, median, and mode for this distribution. Which of the three measures is the most 
representative for the distribution? 


Table 3.15 


Number of defectives Frequenc 


Ease 


Ans. mean=2 = median =0 mode = 0 
The median or mode would be more representative than the mean. On most days, there were none 
or one defective in the sample. The two days on which the process was out of contro! inflated the 
mean. 


The following formula is sometimes used for finding the median of grouped data. 


' Sn-c¢ 
Median = L + , x (U-L) 


In this formula, L is the Jower boundary of the median class, U is the upper boundary of the median class, 
nis the number of observations, f is the frequency of the median class, and c is the cumulative frequency 
of the class proceeding the median class. Give the values for L, U, n, f, c. and find the median for the 
distribution given in Table 3.14. 


Ans. L=69.5 U = 79.5n = 60 f= 15 c=27 median = 71.5 


CHEBYSHEV’S THEOREM AND THE EMPIRICAL RULE 


3.42 


3.43 


A soctological study of gang members tn a large midwestern city found the mean age of the gang 
members in the study to be 14.5 years and the standard deviation to be 1.5 years. According to 
Chebyshev's theorem, at least what percent will be between 10.0 and 19.0 years of age? 


Ans. 89% 


A psychological study of Alcoholic Anonymous members found the mean number of years without 
drinking alcohol for individuals in the study to be 5.5 years and the standard deviation to be 1.5 years. 
The distribution of the number of years without drinking is bell-shaped. What percent of the distribution 
is between: (a) 4.0 and 7.0 years; (b) 2.5 and 8.5 years; (c) |.0 and 10.0 years? 


Ans. (a) 68% (b) 95% (c) 99.7% 


The ranked verbal SAT scores in Table 3.12 are: 


340 350 375 380 400 420 440 445 450 467 495 S00 545 
560 S565 580 590 605 625 630 635 635 640 675 680 
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The mean and standard deviation of these data are 521.1 and 108.1, respectively. According to 
Chebyshev's theorem, at least 75% of the observations are between 304.9 and 737.3. What is the actual 
percent of observations within two standard deviations of the mean? 


Ans. 100% 


COEFFICIENT OF VARIATION 


3.45 The verbal scores on the SAT given in Table 3.12 have a mean equal to 521.1 and a standard deviation 
equal to 108.1. Find the coefficient of variation for these scores. 


Ans. 20.7% 


3.46 Fastners Inc. produces nuts and bolts. One of their bolts has a mean length of 2.00 inches with a standard 
deviation equal to 0.10 inch, and another type bolt has a mean length of 0.25 inch. What standard 
deviation would the second type bolt need to have in order that both types of bolts have the same 
coefficient of variation? 


Ans. 0.0125 inch 


Z SCORES 


3.47 The low-density lipoprotein (LDL) cholesterol concentration for a group has a mean equal to 140 mg/dL 
and a standard deviation equal to 40 mg/dL. Find the z scores for individuals having LDL values of (a) 
115; (b) 140; and (c) 200. 
Ans. {a) —0.63 (b) 0.0 (c) 1.50 


3.48 Three individuals from the group described in problem 3.47 have z scores equal to (a) -1.75; (b) 0.5; and 
(c) 2.0. Find their LDL values. 


Ans. (a) 70 (b) 160 (c) 220 


MEASURES OF POSITION: PERCENTILES, DECILES, AND QUARTILES 


3.49 Table 3.16 gives the ages of commercial aircraft randomly selected from several airlines. Find the 
percentiles for the ages 10, 15, and 20. 


Table 3.16 


Ans. The age 10 is the thirtieth percentile. The age [5 1s the fifty-eighth percentile. The age 20 is the 
eighty-fourth percentile. 


3.50 Find Po, Dg, and Q; for the commercial aircraft ages in Table 3.16. 
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Ans. Poy = 21 Dg = {8 Q, = 16 


INTERQUARTILE RANGE 


3.51 The first quartile for the salaries of county sheriffs in the United States ts $37,500 and the third quartile is 
$50.500. What ts the interquartile range for the salaries of county sheriffs? 


Ans. $13,000 


3.52 Find the interquartile range for the ages of commercial aircraft given in Table 3.16. 


Ans. 9 years 


BOX-AND-WHISKER PLOT 


3.53 A Minitab produced boxplot for the weights of high school football players is shown in Fig. 3-5. Give the 
minimum, maximum, first quarule, median, and third quartile. 


Ter Mec ele T T =) T T 
190) 200 210) 220) 230 240 250 260 270 
Weights 
Fig. 3-5 
Ans. minimum = 195 maximum =270 Q,; = 205 Q, = 240 Q; = 260 
3.54 Construct or use Minitab to construct a boxplot for the ages of the commercial aircraft in Table 3.16. 


Ans. The boxplot is shown in Fig. 3-6. 


Chapter 4 


Probability 


EXPERIMENT, OUTCOMES, AND SAMPLE SPACE 


An experiment is any operation or procedure whose outcomes cannot be predicted with certainty. 
The set of all possible outcomes for an experiment ts called the sample space for the experiment. 


EXAMPLE 4.1 Games of chance are examples of experiments. The single toss of a coin is an experiment 
whose outcomes cannot be predicted with certainty. The sample space consists of two outcomes, heads or tails. 
The letter S is used to represent the sample space and may be represented as S = {H, T}. The single toss of a die 
is an experiment resulting in onc of six outcomes. S may be represented as {1, 2. 3. 4, 5. 6}. When a card is 
selected from a standard deck, 52 outcomes are possible. When a roulette whecl ts spun, the outcome cannot be 
predicted with certainty. 


EXAMPLE 4.2 When a quality control technician selects an item for inspection from a production line, it may 
be classified as defective or nondefective. The sample space may be represented by S = (D, N}. When the blood 
type of a patient 1s determined, the sample space may be represented as S = {A. AB. B, O}. When the Myecrs- 
Briggs personality type indicator is administered to an individual, the sample space consists of 16 possible 
outcomes. 


The experiments discussed in Examples 4.1! and 4.2 are rather simple experiments and the 
descriptions of the sample spaces are straightforward. More complicated experiments are discussed 
in the following section and techniques such as tree diagrams are utilized to describe the sample 
space for these experiments. 


TREE DIAGRAMS AND THE COUNTING RULE 


In a tree diagram, each outcome of an experiment is represented as a branch of a geometric 
figure called a tree. 


EXAMPLE 4.3 Figure 4-1 shows a tree diagram for the experiment of tossing a coin twice. The tree has four 
branches. Each branch is an outcome for the experiment. If the experiment is expanded to threc tosses, the 
branches are simply continued with H or T added to the end of each branch shown in Fig. 4-1. This would result 
in the eight outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH, and TTT. This technique could be continued 
systematically to give the outcomes for n tosses of a coin. Notice that 2 tosses has 4 outcomes and 3 tosses has 8 
outcomes. N tosses has 2“ possible outcomes. 


The counting rule for a two-step experiment states that tf the first step can result in any one of n, 
outcomes, and the second step in any one of nz outcomes, then the experiment can result in (n;)(n2) 
outcomes. If a third step is added with na outcomes, then the experiment can result in (n))(nz)(na) 
outcomes. The counting rule applies to an experiment consisting of any number of steps. If the 
counting rule is applied to Example 4.3, we see that for two tosses of a coin, n; = 2, n) = 2, and the 
number of outcomes for the experiment is 2 x 2 = 4. For three tosses, there are 2 x 2 x 2 = 8 
outcomes and so forth. The counting rule may be used to figure the number of outcomes of an 
experiment and then a tree diagram may be used to actually represent the outcomes. 
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first toss second toss outcome 
HH 


HT 


Fig. 4-1 


EXAMPLE 4.4 For the experiment of rolling a pair of dice, the first die may be any of six numbers and the 
second die may be any one of six numbers. According to the counting rule, there are 6 x 6 = 36 outcomes. The 
outcomes may be represented by a tree having 36 branches. The sample space may also be represented by a two- 
dimensional plot as shown in Fig. 4-2. 


1 
2 
Die2 3 
4 
5 
6 
! 2 3 4 Es) 6 
Die | 
Fig. 4-2 


EXAMPLE 4.5 An experiment consists of observing the blood types for five randomly selected individuals. 
Each of the five will have one of four blood types A, B, AB, or O. Using the counting rule. we see that the 
experiment has 4 x 4 x 4 x 4 x 4 = 1,024 possible outcomes. In this case constructing a tree diagram would be 
difficult. 


EVENTS, SIMPLE EVENTS, AND COMPOUND EVENTS 


An event is a subset of the sample space consisting of at least one outcome from the sample 
space. If the event consists of exactly one outcome, it is called a simple event. lf an event consists of 
more than one outcome, it is called a compound event. 


EXAMPLE 4.6 A quality control technician selects two computer mother boards and classifies each as 
defective or nondefective. The sample space may be represented as S = {NN, ND, DN, DD}, where D 
represents a defective unit and N represents a nondefective unit. Let A represent the event that neither unit is 
defective and let B represent the event that at least one of the units is defective. A= {NN} is a simple event and 
B = {ND, DN, DD} is a compound event. Figure 4-3 is a Venn Diagram representation of the sample space S 
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and the events A and B. In a Venn diagram, the sample space is usually represented by a rectangle and events 
are represented by circles within the rectangle. 


Fig. 4-3 


EXAMPLE 4.7 For the experiment described in Example 4.5, there are 1,024 different outcomes for the blood 
types of the five individuals. The compound event that all five have the same blood type is composed of the 
following four outcomes: (A, A, A, A, A), (B, B, B, B, B), (AB, AB, AB, AB, AB), and (O, O, O, O, O). The 
simple event that all five have blood type O would be the outcome (O, O, O, O. O). 


PROBABILITY 


Probability is a measure of the likelihood of the occurrence of some event. There are several 
different definitions of probability. Three definitions are discussed in the next section. The particular 
definition that is utilized depends upon the nature of the event under consideration. However, all the 
definitions satisfy the following two specific properties and obey the rules of probability developed 
later in this chapter. 

The probability of any event E is represented by the symbol P(E) and the symbol is read as “P of 
E” or as “the probability of event E.” P(E) is a real number between zero and one as indicated in the 
following inequality: 


O< P(E) <I (4.1) 

The sum of the probabilities for all the simple events of an experiment must equal one. That is, if 
E,.E2,...,E,are the simple events for an experiment, then the following equality must be true: 

P(E,) + P(E2) +... + P(E,) = | (4.2) 


Equality (4.2) is also sometimes expressed as in formula (4.3): 
P(S) = 1 (4.3) 


Equation (4.3) states that the probability that some outcome in the sample space will occur is one. 


CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY 
DEFINITIONS 


The classical definition of probability is appropriate when all outcomes of an experiment are 
equally likely. For an experiment consisting of n outcomes, the classical definition of probability 


66 PROBABILITY [CHAP. 4 


; seatoatt ft 
assigns probability — to each outcome or simple event. For an event E consisting of k outcomes, the 
n 


probability of event EF is given by formula (4.4) 
k 
P(E) = - (4.4) 


EXAMPLE 4.8 The experiment of selecting one card randomly from a standard deck of cards has 52 equally 
likely outcomes. The event A, = {club} has probability _ since Ay consists Of 13 outcomes. The event A; = 


{red card} has probability =, since A» consists of 26 outcomes. The event A; = {face card (Jack, Queen, 


King)} has probability aa , since A; consists of [2 outcomes. 


EXAMPLE 4.9 Table 4.1 gives information concerning fifty organ transplants in the state of Nebraska during 
a recent year. Each patient represented in Table 4.1 had only one transplant. If one of the 50 patient records is 


randomly selected. the probability that the patient had a heart transplant is = = .30, since 15 of the patients had 


heart transplants. The probability that a randomly selected patient had to wait one year or more for the transplant 


is = = .40, since 20 of the patients had to wait one year or more. The display in Table 4.1 is called a nvo-way 


table. It displays two different variables concerning the patients. 


Table 4.1 


Waiting Time for Transplant 
Type of transplant 


10 5 


Heart 
Kidney 7 3 
Liver 5 
Pancreas 


3 
Eyes 5 


EXAMPLE 4.10 To find the probability of the event A that the sum of the numbers on the faces of a pair of 
dice equals seven when a pair of dice ts rolled, consider the sample space shown in Fig. 4-4. The event A is 
shown as a rectangular box in the sample space. The outcomes in A are as follows A = {(1. 6), (2, 5), (3, 4), (4, 
3). (5S, 2). (6, L)}. Since A contains six of the thirty-six equally likely outcomes for the experiment, the 


probability of event A is = : 


Die 2 
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The classical definition of probability is not always appropriate in computing probabilities of 
events. If a coin 1s bent, heads and tails are not equally likely outcomes. If a die has been loaded. 


each of the six faces do not have probability of occurrence equal to ~ For experiments not having 


equally likely outcomes, the relative frequency definition of probability is appropriate. The relative 
frequency definition of probability states that if an experiment is performed n times, and if event E 
occurs f times, then the probability of event E is given by formula (4.5). 


P(E) = (4,5) 


f 
n 
EXAMPLE 4.11 A bent coin is tossed 50 times and a head appears on 35 of the tosses. The relative frequency 
definttion of probability assigns the probability = = .70 to the event that a head occurs when this coin is tossed. 


A loaded die is tossed 75 times and the face “6” appears 15 times in the 75 tosses. The relative frequency 


definition of probability assigns the probability oi = .20 to the event that the face 6" will appear when this die 


is tossed. 


EXAMPLE 4.12 A study by the state of Tennessee found that when 750 drivers were randomly stopped 471 
were found to be wearing seat belts. The relative frequency probability that a driver wears a seat belt in 


. a7) 
Tennessee is —- = 0.63. 
750 


There are many circumstances where neither the classical definition nor the relative frequency 
definition of probability is applicable. The subjective definition of probability utilizes intuition, 
experience, and collective wisdom to assign a degree of belief that an event will occur. This method 
of assigning probabilities allows for several different assignments of probability to a given event. 
The different assignments must satisfy formulas (4./) and (4.2). 


EXAMPLE 4.13 A military planner states that the probability of nuclear war in the next year is 1%. The 
individual is assigning a subjective probability of .01 to the probability of the event “nuclear war in the next 
year.” This event does not lend itself to either the classical definition or the relative frequency definition of 
probability. 


EXAMPLE 4.14 A medical doctor tells a patient with a newly diagnosed cancer that the probability of 
successfully treating the cancer is 90%. The doctor is assigning a subjective probability of .90 to the event that 
the cancer can be successfully treated. The probability for this event cannot be determined by cither the classical 
definition or the relative frequency definition of probability. 


MARGINAL AND CONDITIONAL PROBABILITIES 


Table 4.2 classifies the 500 members of a police department according to their minority status as 
well as their promotional status during the past year. One hundred of the individuals were classified 
as being a minority and seventy were promoted during the past year. The probability that a randomly 


selected individual from the police department is a minority is ~ = .20 and the probability that a 


randomly selected person was promoted during the past year is an = .14. Table 4.3 is obtained by 
dividing each entry in Table 4.2 by S00. 


68 PROBABILITY [CHAP. 4 


Table 4.2 
Yes 50 20 
| Total =| S400 | it 


The four probabilities in the center of Table 4.3, .70, .16, .10, and .04, are called joint 
probabilities. The four probabilities in the margin of the table, .80, .20, .86, and .14. are called 
marginal probabilities. 


Table 4.3 
Yes 
No 


ae a ae Oe 
Yes 10 .04 
| Total | 80 OT 


The joint probabilities concerning the selected police officer may be described as follows: 


.70 = the probability that the selected officer is not a minority and was not promoted 
.16 =the probability that the selected officer is a minority and was not promoted 

.10 = the probability that the selected officer ts not a minority and was promoted 
.04 = the probability that the selected officer is a minority and was promoted 


The marginal probabilities concerning the selected police officer may be described as follows: 


.80 = the probability that the selected officer is not a minority 

.20 = the probability that the selected officer is a minority 

.86 = the probability that the selected officer was not promoted during the last year 
.14 = the probability that the selected officer was promoted during the last year 


In addition to the joint and marginal probabilities discussed above, another important concept is 
that of a conditional probability. If it is known that the selected police officer is a minority, then the 


conditional probability of promotion during the past year is a = .20, since 100 of the police officers 


in Table 4.2 were classified as minority and 20 of those were promoted. This same probability may 
be obtained from Table 4.3 by using the ratio = = .20. 


2 


The formula for the conditional probability of the occurrence of event A given that event B 1s 
known to have occurred for some experiment is represented by P(A | B) and is the ratio of the joint 
probability of A and B divided by the probability of B. The following formula is used to compute a 
conditional probability. 


P(A and B) 


P(A | B) = PB) 


(4.6) 


The following example summarizes the above discussion and the newly introduced notation. 


EXAMPLE 4.15 For the experiment of selecting one police officer at random from those described in Table 
4.2, define event A to be the event that the individual was promoted last year and define event B to be the event 
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that the individual is a minority. The joint probability of A and B is expressed as P(A and B) = .04. The 
marginal probabilities of A and B are expressed as P(A) = .14 and P(B) = .20. The conditional probability of A 
P(A and B) __ .04 


= — =.20. 


given B is P(A| B) = P(B) 7 


MUTUALLY EXCLUSIVE EVENTS 


Two or more events are said to be mutually exclusive if the events do not have any outcomes in 
common. They are events that cannot occur together. If A and B are mutually exclusive events then 
the joint probability of A and B equals zero, that is, P(A and B) = 0. A Venn diagram representation 
of two mutually exclusive events is shown in Fig. 4-5. 


A 


Fig. 4-5 


EXAMPLE 4.16 An experiment consists in observing the gender of two randomly selected individuals. The 
event, A, that both individuals are male and the event, B, that both individuals are female are mutually exclusive 
since 1f both are male, then both cannot be female and P(A and B) = 0. 


EXAMPLE 4.17 Let event A be the event that an employee at a large company is a white collar worker and let 
B be the event that an employee is a blue collar worker. Then A and B are mutually exclusive since an employee 
cannot be both a blue collar worker and a white collar worker and P(A and B) = 0. 


DEPENDENT AND INDEPENDENT EVENTS 


If the knowledge that some event B has occurred influences the probability of the occurrence of 
another event A, then A and B are said to be dependent events. If knowing that event B has occurred 
does not affect the probability of the occurrence of event A, then A and B are said to be independent 
events. Two events are independent if the following equation is satisfied. Otherwise the events are 
dependent. 

P(A | B) = P(A) (4.7) 


The event of having a criminal record and the event of not having a father in the home are 
dependent events. The events of being a diabetic and having a family history of diabetes are 
dependent events, since diabetes is an inheritable disease. The events of having 10 letters in your last 
name and being a sociology major are independent events. However, many times it is not obvious 
whether two events are dependent or independent. In such cases, formula (4.7) 1s used to determine 
whether the events are independent or not. 
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EXAMPLE 4.18 For the experiment of drawing one card from a standard deck of 52 cards, let A be the event 
that a club is selected, let B be the event that a face card (jack, queen, or king) is drawn, and let C be the event 


that a jack is drawn. Then A and B are independent events since P(A) = a = .25 and P(A | B) = = 25. 


P(A | B)= 5 = .25. since there are 12 face cards and 3 of them are clubs. The events B and C are dependent 
events since P(C) = = = .077 and P(C | B) = = = .333. P(C| B) = = = .333, since there are 12 face cards and 
4 of them are jacks. 


EXAMPLE 4.19 Suppose one patient record is selected from the 125 represented in Table 4.4. The event that 
a patient has a history of heart disease, A, and the event that a patient is a smoker, B, are dependent events, since 


P(A) = ‘© =.12 and P(A] B) = re = .22. For this group of patients, knowing that an individual is a smoker 


125 


almost doubles the probability that the individual has a history of heart disease. 


Table 4.4 


Total 


COMPLEMENTARY EVENTS 


To every event A, there corresponds another event A‘, called the complement of A and consisting 
of all other outcomes in the sample space not in event A. The word not is used to describe the 
complement of an event. The complement of selecting a red card is not selecting a red card. The 
complement of being a smoker is not being a smoker. Since an event and its complement must 
account for all the outcomes of an experiment, their probabilities must add up to one. If A and A‘ are 
complementary events then the following equation must be true. 


P(A) + P(A‘) = 1 (4.8) 


EXAMPLE 4.20 Approximately 2% of the American population is diabetic. The probability that a randomly 
chosen American is not diabetic is .98, since P(A) = .02, where A is the event of being diabetic, and .02 + P(A‘) 
= |. Solving for P(A‘) we get P(A‘) = | - .02 = 98. 


EXAMPLE 4,21 Find the probability that on a given roll of a pair of dice that “snake eyes” are not rolled. 
Snake eyes means that a one was observed on each of the dice, Let A be the event of rolling snake eyes. Then 


P(A) = = = 028. The event that snake eyes are not rolled is A°. Then using formula (4.8), .028 + P(A‘) = 1, and 
solving for P(A‘), it follows that P(A‘) = 1 -— .028 = .972. 


Complementary events are always mutually exclusive events but mutually exclusive events are 
not always complementary events. The events of drawing a club and drawing a diamond from a 
standard deck of cards are mutually exclusive, but they are not complementary events. 


MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS 


The intersection of two events A and B consists of all those outcomes which are common to both 
A and B. The intersection of the two events 1s represented as A and B. The intersection of two events 
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is also represented by the symbol A ~ B, read as A intersect B. A Venn diagram representation of 
the intersection of two events is shown in Fig. 4-6. 


Fig. 4-6 


The probability of the intersection of two events is given by the multiplication rule. The 
multiplication rule is obtained from the formula for conditional probabilities, and is given by formula 
(4.9), 

P(A and B) = P(A) P(B | A) (4.9) 


EXAMPLE 4.22 A small hospital has 40 physicians on staff of which 5 are cardiologists. The probability that 
two randomly selected physicians are both cardiologists is determined as follows. Let A be the event that the 
first selected physician is a cardiologist, and B the event that the second selected physician is a cardiologist. 


Then P(A) = -. = .125, P(B | A) = =F .103 and P(A and B) = .125 x .103 = .013. If two physicians were 


selected from a group of 40,000 of which 5,000 were cardiologists, then P(A) = — = .125, P(B| A)= pate 


= .125, and P(A and B) = (.125)* = .016. Notice that when the selection is from a large group, the probability of 
selecting a cardiologist on the second selection is approximately the same as selecting one on the first selection. 

Following this line of reasoning, suppose it is known that 12.5% of all physicians are cardiologists. If three 
physicians are selected randomly, the probability that all three are cardiologists equals (.125)’ = .002. The 
probability that none of the three are cardiologists is (.875)° = .670. 


If events A and B are independent events, then P(A|B) = P(A) and P(B|A) = P(B). When P(B|A) 
is replaced by P(B), formula (4.9) simplifies to 


P(A and B) = P(A) P(B) (4.10) 


EXAMPLE 4.23 Ten percent of a particular population have hypertension and 40 percent of the same 
population have a home computer. Assuming that having hypertension and owning a home computer are 
independent events, the probability that an individual from this population has hypertension and owns a home 
computer is .10 x .40 = .04. Another way of stating this result is that 4 percent have hypertension and own a 
home computer. 


ADDITION RULE FOR THE UNION OF EVENTS 


The wnion of two events A and B consists of all those outcomes that belong to A or B or both A 
and B. The union of events A and B is represented as A U B or simply as A or B. A Venn diagram 
representation of the union of two events is shown in Fig. 4-7. The darker part of the shaded union of 
the two events corresponds to overlap and corresponds to the outcomes in both A and B. 
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Fig. 4-7 


To find the probability in the union, we add P(A) and P(B). Notice, however, that the darker part 
indicates that P(A and B) gets added twice and must be subtracted out to obtain the correct 
probability. The resultant equation is called the addition rule for probabilities and is given by 
formula (4.//), 

P(A or B) = P(A) + P(B) — P(A and B) (4.11) 


If A and B are mutually exclusive events, then P(A and B) = 0 and the formula (4.//) simplifies 
to the following. 
P(A or B) = P(A) + P(B) 


EXAMPLE 4.24 Forty percent of the employees at Computec, Inc. have a college degree, 30 percent have 
been with Computec for at least three years, and 15 percent have both a college degree and have been with the 
company for at least three years. If A 1s the event that a randomly selected employee has a college degree and B 
is the event that a randomly selected employee has been with the company at least three years, then A or B ts the 
event that an employee has a college degree or has been with the company at least three years. The probability 
of A or B is .40 + .30 — .15 = .55. Another way of stating the result is that 55 percent of the employees have a 
college degree or have been with Computec for at least three years. 


EXAMPLE 4.25 A hospital employs 25 medical-surgical nurses, 10 intensive care nurses, 15 emergency room 
nurses, and 50 floor care nurses. If a nurse is selected at random, the probability that the nurse is a medical- 
surgical nurse or an emergency room nurse is .25 + .15 = .40. Since the events of being a medical-surgical nurse 
and an emergency room nurse are mutually exclusive, the probability is simply the sum of probabilities of the 
two events. 


BAYES’ THEOREM 


A computer disk manufacturer has three locations that produce computer disks. The Omaha plant 
produces 30% of the disks, of which 0.5% are defective. The Memphis plant produces 50% of the 
disks, of which 0.75% are defective. The Kansas City plant produces the remaining 20%, of which 
0.25% are defective. If a disk 1s purchased at a store and found to be defective, what is the 
probability that it was manufactured by the Omaha plant? This type of problem can be solved using 
Bayes’ theorem. To formalize our approach, let A; be the event that the disk was manufactured by 
the Omaha plant, let A, be the event that the disk was manufactured by the Memphis plant, and let A; 
be the event that it was manufactured by the Kansas City plant. Let B be the event that the disk is 
defective. We are asked to find P(A, | B). This probability is obtained by dividing P(A, and B) by 
P(B). 
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The event that a disk is defective occurs if the disk is manufactured by the Omaha plant and is 
defective or if the disk is manufactured by the Memphis plant and 1s defective or if the disk is 
manufactured by the Kansas City plant and is defective. This is expressed as follows 


B = (A, and B) or (A, and B) or (A; and B) (4./2) 


Because the three events which are connected by or’s in formula (4./2) are mutually exclusive, 
P(B) may be expressed as 


P(B) = P(A; and B) + P(A: and B) + P(A; and B) (4.1/3) 
By using the multiplication rule, formula (4./3) may be expressed as 
P(B) = P(B | Ai) P(A;) + P(B | Az) P(A2) + P(B | Aa) P(A3) (4.14) 


Using formula (4./4), P(B) = .005 x .3 + .0075 x .5 + .0025 x .2. = .00575. That is, 0.575% of 
the disks manufactured by all three plants are defective. The probability P(A; and B) equals 
0015 
00575 
Summarizing, if a defective disk is found, the probability that it was manufactured by the Omaha 
plant is .261. 
In using Bayes’ theorem to find P(A, | B) use the following steps: 


Step 1: Compute P(A, and B) by using the equation P(A; and B) = P(B | Ay) P(A)). 
Step 2: Compute P(B) by using formula (4. /4). 
Step 3: Divide the result in step | by the result in step 2 to obtain P(A; | B). 


These same steps may be used to find P(A2 | B) and P(A; | B). 

Events like A; , Ay , and A; are called collectively exhaustive. They are mutually exclusive and 
their union equals the sample space. Bayes’ theorem is applicable to any number of collectively 
exhaustive events. 


P(B | A,) P(A;) = .005 x .3 = .0O15. The probability we are seeking is equal to = .261. 


EXAMPLE 4.26 Using the three-step procedure given above, the probability that a defective disk was 
manufactured by the Memphis plant is found as follows. 


Step 1: P(A, and B) = P(B | Az) P(A2) = .0075 x .5 = .00375. 
Step 2: P(B) = .005 x .3 + .0075 x .5 + .0025 x .2 = .00575. 


00375 
Step 3: P(A, |B) = ——— =.652. 
.00575 


The probability that a defective disk was manufactured by the Kansas City plant is found as follows. 


Step 1: P(A; and B) = P(B | A3) P(A3) = .0025 x .2 = .0005. 
Step 2: P(B) = .005 x .3 + .0075 x .5 + .0025 x .2 = .00575. 


0005 
Step 3: P(A2| B) = ——— =.087. 
00575 


PERMUTATIONS AND COMBINATIONS 


Many of the experiments in statistics involve the selection of a subset of items from a larger 
group of items. The experiment of selecting two letters from the four letters a, b, c, and d is such an 
experiment. The following pairs are possible: (a, b), (a, c), (a, d), (b, c), (b, d), and (c, d). We say that 
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when selecting two items from four distinct items that there are six possible combinations. The 
number of combinations possible when selecting n from N items is represented by the symbol CN and 
is given by 

N! 


ni(N-n)! 


ce (4.15) 


N 


| are three other notations that are used for the number of combinations in 
n 


NC, C(N, n), and | 


addition to the symbol CY. 

The symbol n!, read as “n factorial,” is equal ton x (n— 1) x (n- 2) x... 1. For example, 3! = 
3x2x1!=6,and 4!=4x3x2~x 1 = 24. The values for n! become very large even for small values 
of n. The value of 10! ts 3,628,800, for example. 

In the context of selecting two letters from four, N!=4!'=4x3x2x 1=24,n!=2!=2x1=2 
and (N — n)! = 2! = 2. The number of combinations possible when selecting two items from four is 

' 
given by C3} = i = =< = 6, the same number obtained when we listed all possibilities above. 
12! x 
When the number of items is larger than four or five, it is difficult to enumerate all of the 
possibilities. 


EXAMPLE 4.27 The number of five card poker hands that can be dealt from a deck of 52 cards is given by 

52 92! _ S2x51xSO0x49x48x47!  52x51x50x 49 x 48 
5 5147) 120 x 47! 7 120 

as 52 x 51 x 50 x 49 x 48 x 47!, we are able to divide 47! out because it is a common factor in both the 

numerator and the denominator. 


= 2,598,960. Notice that by expressing 52! 


If the order of selection of items is important, then we are interested in the number of 
permutations possible when selecting n items from N items. The number of permutations possible 
when selecting n objects from N objects is represented by the symbol Pl’, and given by 


N! 
pr = (4.16) 
(N-n)! 


nP,, P(N, n) and (N), are other symbols used to represent the number of permutations. 


EXAMPLE 4.28 The number of permutations possible when selecting two letters from the four letters a, b, c, 
4! 4! 24 

aoa = Ey = za = 12. In this case, the 12 permutations are easy to list. They are ab, ba, ac, ca, 

ad, da, bc, cb, bd, db, cd, and dc. There are always more permutations than combinations when selecting n items 

from N, because each different ordering is a different permutation but not a different combination. 


and dis P3 = 


EXAMPLE 4.29 A president, vice president, and treasurer are to be selected from a group of 10 individuals. 
How many different choices are possible? In this case, the order of listing of the three individuals for the three 
offices is important because a slate of Jim, Joe, and Jane for president, vice president, and treasurer is different 
from Joe, Jim, and Jane for president, vice president, and treasurer, for example. The number of permutations 1s 
10! : : : 
pi’ = or 10x9x8 = 720. That is, there are 720 different sets of size three that could serve as president, vice 


president, and treasurer. 
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USING PERMUTATIONS AND COMBINATIONS TO SOLVE 
PROBABILITY PROBLEMS 


EXAMPLE 4.30 For a lotto contest in which six numbers are selected from the numbers 01 through 45, the 
' 


45! 
number of combinations possible for the six numbers selected is Cé° = rs 8,145,060. The probability that 


I fi 
you select the correct six numbers in order to win this lotto is Evora 000000123. The probability of 


winning the lotto can be reduced by requiring that the six numbers be selected in the correct order. The number 
45! 


of permutations possible when six numbers are selected from the numbers 01 through 45 is given by p2* = —— 
39! 


= 45 x 44 x 43 x 42 x 41 x 40 = 5,864,443,200. The probability of winning the lotto is See. See = 
5.864 443.200 


000000000171, when the order of selection of the six numbers is important. 
EXAMPLE 4.31 A Royal Flush is a five-card hand consisting of the ace, king, queen. jack, and ten of the same 


4 
suit. The probability of a Royal Flush ts equal to W558 OED = .00000154, since from Example 4.27, there are 


2,598,960 five-card hands possible and four of them are Royal Flushes. 


Solved Problems 


EXPERIMENT, OUTCOMES, AND SAMPLE SPACE 


4.1 An experiment consists of flipping a coin, followed by tossing a die. Give the sample space for 
this experiment. 


Ans. One of many possible representations of the sample space is S = {H1, H2, H3, H4, H5, H6, TI. 
T2, T3, T4, TS, T6}. 


4.2 Give the sample space for observing a patient's Rh blood type. 


Ans. One of many possible representations of the sample space is S = {Rh,, Rh’). 


TREE DIAGRAMS AND THE COUNTING RULE 


4.3 Use a tree diagram to illustrate the sample space for the experiment of observing the sex of the 
children in families consisting of three children. 


Ans. The tree diagram representation for the sex distribution of the three children is shown in Fig. 4-8, 
where, for example, the branch or outcome mfm represents the outcome that the first born was a 
male, the second born was a female, and the last born was a male. 
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Fig. 4-8 


A sociological study consists of recording the marital status, religion, race, and income of an 
individual. If marital status is classified into one of four categories, religion into one of three 
categories, race into one of five categories, and income into one of five categories, how many 
outcomes are possible for the experiment of recording the information of one of these 
individuals? 


Ans. Using the counting rule, we see that there are 4 x 3 x 5 x 5 = 300 outcomes possible. 


EVENTS, SIMPLE EVENTS, AND COMPOUND EVENTS 


4.5 


4.6 


For the sample space given in Fig. 4-8, give the outcomes associated with the following events 
and classify each as a simple event or a compound event. 

(a) At least one of the children is a girl. 

(b) All the children are of the same sex. 

(c) None of the children are boys. 

(d) All of the children are boys. 


Ans. The event that at least one of the children is a girl means that either one of the three was a girl, or 
two of the three were girls, or all three were girls. The event that all were of the same sex means 
that all three were boys or all three were girls. The event that none were boys means that all three 
were girls. The outcomes for these events are as follows: 

(a) mmf, mfm, mff, fmm, fmf, ffm, fff; compound event 
(6) mmm, fff; compound event 

(c) fff; simple event 

(d) mmm; simple event 


In the game of Yahtzee, five dice are thrown simultaneously. How many outcomes are there for 
this experiment? Give the outcomes that correspond to the event that the same number 
appeared on all five dice. 


Ans. By the counting rule, there are 6 x 6 x 6 x 6 x 6 = 7,776 outcomes possible. Six of these 7.776 
outcomes correspond to the event that the same number appeared on all five dice. These six 
outcomes are as follows: (1, 1, 1, 1, 1), (2, 2, 2, 2, 2), (3, 3, 3, 3, 3), (4, 4, 4, 4, 4), 6S, 5, 5, 5. 5), 
(6. 6, 6, 6, 6). 
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PROBABILITY 

4.7. Which of the following are permissible values for the probability of the event E? 
(a) P(E)=.75 (b) P(E) = —.25 (c) P(E) = 1.50 
(7) P(E)=1 (e) P(E) =.01 


Ans. The probabilities given in (a), (d), and (e) are permissible since they are between O and | 
inclusive. The probability in (6) is not permissible, since probability measure can never be 
negative. The probability in (c) ts not permissible, since probability measure can never exceed one. 


4.8 An experiment is made up of five simple events designated A, B, C, D, and E. Given that 
P(A) =.1, P(B) = .2, P(C) = .3, and P(E) = .2, find P(D). 


Ans. The sum of the probabilities for all simple events in an experiment must equal one. This implies 
that .1 + .2 + .3 + P(D) + .2 = 1, and solving for P(D). we find that P(D) = .2. Note that this 
experiment does not have equally likely outcomes. 


CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY 
DEFINITIONS 


4.9 A container has 5 red balls, 10 white balls, and 35 blue balls. One of the balls is selected 
randomly. Find the probability that the selected ball ts (a) red; (b) white; (c) blue. 


Ans. This experiment has 50 equally likely outcomes. The event that the ball is red consists of five 
outcomes, the event that the ball 1s white consists of 10 outcomes, and the event that the ball is 
blue consists of 35 outcomes. Using the classical definition of probability, the following 


probabilities are obtained: (a) = =10 3 (db) _ =.20; (c) = =.70. 


4.10 A store manager notes that for 250 randomly selected customers, 75 use coupons in their 
purchase. What definition of probability should the manager use to compute the probability 
that a customer will use coupons in their store purchase? What probability should be assigned 
to this event? 


Ans. The relative frequency definition of probability should be used. The probability of using coupons 
in store purchases is approximately = =30, 
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4.11 Statements such as “the probability of snow tonight is 70%,” “the probability that it will rain 
today is 20%,” and “the probability that a new computer software package will be successful is 
99%” are examples of what type of probability assignment? 


Ans. Since all three of the statements are based on professional judgment and experience, they are 
subjective probability assignments. 


MARGINAL AND CONDITIONAL PROBABILITIES 


4.12 Financial Planning Consultants Inc. keeps track of 500 stocks. Table 4.5 classifies the stocks 
according to two criteria. Three hundred are from the New York exchange and 200 are from 
the American exchange. Two hundred are up, 100 are unchanged, and 200 are down. 
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Table 4.5 


U 
5 


y i 
0 
25 


If one of these stocks is randomly selected find the following. 

(a) The joint probability that the selected stock is from AMEX and unchanged. 

(b) The marginal probability that the selected stock is from NYSE. 

(c) The conditional probability that the stock is unchanged given that it is from AMEX. 


25 
ap 705. 


Ans. (a) There are 25 from the AMEX and unchanged. The joint probability 1s 


(b) There are 300 from NYSE. The marginal probability ts sag =.60. 
(c) There are 200 from the AMEX. Of these 200, 25 are unchanged. The conditional probability 


23 
SAO =,125. 


is 
Twenty percent of a particular age group has hypertension. Five percent of this age group has 
hypertension and diabetes. Given that an individual from this age group has hypertension, what 
is the probability that the individual also has diabetes? 


Ans. Let A be the event that an individual from this age group has hypertension and let B be the event 
that an individual from this age group has diahetes. We are given that P(A) = .20 and P(A and B) = 


05 
Os, Weare Gskea hind POR AD- PBL AIS ee 95. 
P(A) 20 


MUTUALLY EXCLUSIVE EVENTS 


4.14 


4.15 


For the experiment of drawing a card from a standard deck of 52, the following events are 
defined: A is the event that the card is a face card, B is the event that the card is an Ace, C is 
the event that the card is a heart, and D is the event that the card is black. List the six pairs of 
events and determine which are mutually exclusive. 


ANS. 


None 
Jack, queen, and king of hearts 


Jack, queen, and king of spades and clubs 
Ace of hearts 
Ace of clubs and ace of spades 
None 


Three items are selected from a production process and each its classified as defective or non- 
defective. Give the outcomes in the following events and check each pair to see if the pair is 
mutually exclusive. Event A is the event that the first item 1s defective, B is the event that there 
is exactly one defective in the three, and C is the event that all three items are defective. 


Ans. Aconsists of the outcomes DDD, DDN, DND, and DNN. B consists of the outcomes DNN, NDN, 
and NND. C consists of the outcome DDD. (D represents a defective, and N represents a 
nondefective.) 


CHAP. 4) PROBABILITY 79 


A,B No DNN 
A,C No DDD 
Yes None 


DEPENDENT AND INDEPENDENT EVENTS 


4.16 African-American males have a higher rate of hypertension than the general population. Let A 
represent the event that an individual is hypertensive and let B represent the event that an 
individual is an African-American male. Are A and B independent or dependent events? 


Ans. To say that African-American males have a higher rate of hypertension than the general population 
means that P(A | B) > P(A). Since P(A | B) # P(A), A and B are dependent events. 


4.17 Table 4.6 gives the number of defective and nondefective items in samples from two different 
machines. Is the event of a defective item being produced by the machines dependent upon 
which machine produced it? 


Table 4.6 
ESS Number defective Number nondefective 


Machine | 5 195 
Machine 2 15 585 


Ans. Let D be the event that a defective item ts produced by the machines, Let M, be the event that the 


item is produced by machine }, and let M2 be the event that the item is produced by machine 2. 
5 = 


P(D) = a =.025, P(D | M,) = San =,025 , and P(D | M>) = a =.025 , and the event of producing 


a defective item 1s independent of which machine produces it. 


COMPLEMENTARY EVENTS 


4.18 The probability that a machine does not produce a defective item during a particular shift is 
.90. What is the complement of the event that a machine does not produce a defective item 
during that particular shift and what is the probability of that complementary event? 


Ans. ‘The complementary event is that the machine produces at least one defective item during the shift, 
and the probability that the machine produces at least one defective item during the shift is 
1- .90=.10. 


4.19 Events E), E,, and E; have the following probabilities of occurrence: P(E,) = .05, P(E) = .50, 
and P(E;) = .99. Find the probabilities of the complements of these events. 


Ans. P( E,‘)= 1 -P(E,) = 1-.05 = .95 P(E,°) = 1 — P(Ex) = 1 - .50 = .50 
P(E; )= | — P(E;) = 1- .99= 01 


MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS 


4.20 If one card is drawn from a standard deck, what is the probability that the card is a face card? If 
two cards are drawn, without replacement, what is the probability that both are face cards? If 
five cards are drawn, without replacement, what is the probability that all five are face cards? 


80 


4.21 


PROBABILITY (CHAP. 4 


Ans. Let E, be the event that the first card is a face card, let E. be the event that the second drawn card 
is a face card, and so on until Es, represents the event that the fifth card is a face card. The 


probability that the first card is a face card is P(E,) = = =.231. The event that both are face cards 


is the event E, and E2, and P(E, and E,) = P(E,) P(E2 | E)) = ak a = 5a = 050. The event that 


all five are face cards is the event E, and E, and E, and Ey and Es. The probability of this event is 
given by P(E,) P(E2 | E,) P(E; } E), Ex) P(E, | E;. Ex, Ex) P(Es | Ey, Es, Ea, Ey). or 


{2 1] 10 9 8 95 040 
az * a % 90% ao * Ge > Taveoey 7 000305. 


If 60 percent of al] Americans own a handgun, find the probability that all five in a sample of 
five randomly selected Americans own a handgun. Find the probability that none of the five 
own a handgun. 


Ans. Let E, be the event that the first individual owns a handgun, E, be the event that the second 
individual owns a handgun, E; be the event that the third individual owns a handgun, E, be the 
event that the fourth individual owns a handgun, and Es be the event that the fifth individual owns 
a handgun. The probability that all five own a handgun is P(E; and E, and E, and Ey and Es). 
Because of the large group from which the individuals are selected, the events E, through Es are 
independent and the probability is given by P(E,) P(E2) P(E;) P(Ey) P(Es) = (.6)° = 078. Similarly, 
the probability that none of the five own a handgun is (.4)° = O10. 


ADDITION RULE FOR THE UNION OF EVENTS 


4.22 


4,23 


Table 4.7 gives the IQ rating as well as the creativity rating of 250 individuals in a 
psychological study. Find the probability that a randomly selected individual from this study 
will be classified as having a high IQ or as having high creativity. 


Table 4.7 
Psi‘ towdtQHighiQn 


Low creativity 75 30 
High creativit 20 125 


Ans. Let A be the event that the selected individual has a high IQ, and let B be the event that the 
155 145 


individual has high creativity. Then P(A) = +2 = 62, P(B)= 2 = 58. P(A and B)= !> =.50, 


250 250 2450 


and P(A or B) = P(A) + P(B) — P(A and B) = .62 + .58 — 50 = .70. 


The probability of event A is .25, the probability of event B is .10, and A and B are inde- 
pendent events. What is the probability of the event A or B? 


Ans. Since A and B are independent events, P(A and B) = P(A) P(B) = .25 x .10 = .025. P(A or B) = 
P(A) + P(B) — P(A and B) = .25 + .10 — .025 = .325. 


BAYES’ THEOREM 


4.24 Box | contains 30 red and 70 white balls, box 2 contains 50 red and 50 white balls, and box 3 


contains 75 red and 25 white balls. The three boxes are all emptied into a large box, and a ball 
is selected at random. If the selected ball is red, what is the probability that it came from (a) 
box |; (b) box 2; (c) box 3? 
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Ans. 


Let B, be the event that the selected ball came from box 1, let Bz be the event that the selected ball 
came from box 2, let B; be the event that the ball came from box 3, and let R be the event that the 
selected ball is red. 


(a) We are asked to find P(B, | R). The three-step procedure ts as follows: 
Step 1: P(R and B,) = P(R | By) P(B)) = x ¥ = 10. 
Step 2: P(R) = P(R | B,) P(B,) + P(R | Bz) P(B2) + P(R | B3) P(B;) 
P(R) = > =.517 


0 
Step 3: P(B,|R)= — =.19 
ep (B,{ R) 517 


io) 


(b) We are asked to find P(B> { R). 
Step 1: P(R and B,) = P(R | Bp) P(B2) = x 5 
Step 2: P(R)=.517 


H} 


167 


167 
Step 3: P(B)| R)= —— = .323 
517 


(c) We are asked to find P(B; | R). 
Step 1: P(R and By) = P(R} By) P(B2) = 2X x 5 = .250 
Step 2: P(R)=.517 


250 
Step 3: P(Bi | R)= —— = .484 
517 


4.25 Table 4.8 gives the percentage of the U.S. population in four regions of the United States, as 
well as the percentage of social security recipients within each region. For the population of all 
social security recipients, what percent live in each of the four regions? 


Ans. 


Table 4.8 


Percentage of U.S. Percentage of social security 
Region opulation recipients in the region 


Northeast 


Midwest 
South 
West 


Let B, be the event that an individual lives in the Northeast region, B, be the event that an 
individual lives in the Midwest region, B, be the event that an individual lives in the South, and B, 
be the event that an individual lives in the West. Let S be the event that an individual is a social 
security recipient. We are given that P(B,) = .20, P(B2) = .25, P(B;) = .35. P(B4) = .20, P(S | By) = 
.15, P(S | Bz) = .10, P(S | B3) = .12, and P(S [| Bg) = .11. We are to find P(B,; | S), P(B> |S), 
P(B;|S), and P(B,{S). P(S) is needed to find each of the four probabilities. 


P(S) = P(S | By) P(B,) + P(S { Bz) P(B2) + P(S | By) P(B3) + P(S | B,) P(B4) 
P(S) = .15 x .204 10x .25 + .12 * .35+.11 x .20=.119, 
This means that 11.9% of the population are social security recipients. 


The three-step procedure to find P(B, | S) is as follows: 
Step I: P(B, and S) = P(S | B,) P(B,) = .15 x .20 = .03 
Step 2: P(S)=.119 


03 
Step 3: P(B,|S)= Fee) = 2252 
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The three-step procedure to find P(B, |S) ts as follows: 
Step |: PCB, and S) = P(S | B,) P(B2) = .10 x .25 = .025 
Step 2: P(S)=.119 


A 
Step 3: P(B; |S) = — = 210 


The three-step procedure to find P(B; | S) is as follows: 
Step |: P(By and S) = P(S | By) P(By) = .12 x .35 = .042 
Step 2: P(S)=.119 


042 
Step 3: P(Bs|S)= >= 353 


The three-step procedure to find P(By | S) is as follows: 
Step 1: P(By and S) = P(S | By) P(By) = 11 x 20 = .022 
Step 2: P(S)=.119 

022 
Step 3: P(Bs{S)= Tio 185 


We can conclude that 25.2% of the social security recipients are from the Northeast, 21.0% are 
from the Midwest, 35.3% are from the South, and 18.5% are from the West. 


PERMUTATIONS AND COMBINATIONS 
4.26 Evaluate the following: (a) CO; (b) CR; (c) Pd; (d) PP. 


Ans. Each of the four parts uses the fact that 0! = 1. 


n! n! : ek 
(a) Ch == = —— =I. Since the n! in the numerator and denominator divide out. 
O'’%n—-O)! xn! 
n! n! 
(b) Cy == = —— = 1. since 0! is equal to one and the n! divides out of top and bottom. 
ni(n—n)! nO! 
i n! 
(ey Phe es 
(n—O)! ni! 
{ nif on! 
(dy) Phe S—S-— =n! 


4.27 An exacta wager at the racetrack 1s a bet where the bettor picks the horses that finish first and 
second. A trifecta wager is a bet where the bettor picks the three horses that finish first, 
second, and third. (a) In a 12-horse race, how many exactas are possible? (b) In a 12-horse 
race, how many trifectas are possible? 


Ans. Since the finish order of the horse is important, we use permutations to count the number of 
possible selections. 


(@) The number of ordered ways you can select two horses from twelve is 


W2h 2! 12 x11 «10! 


See = 12x11) =132 
(12-2)! 10! 10! 


4 
Py = 
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(b) The number of ordered ways you can select three horses from twelve is 


12! W2t 12K IT 109! 


—— = = =12xtlx 10= 1320 
(f2=—3)) 91 9! 


Pi’ = 


4.28 A committee of five senators is to be selected from the U.S. Senate. How many different 
committees are possible? 


Ans. Since the order of the five senators 1s not important, the proper counting technique 1s combina- 
tions. 
100! 100 x 99 x 98x97 x96x95! 100 x 99 x 98 x 97 x 96 
51(100-5)! 120 95! 7 120 
There are 75,287,520 different committees possible. 


C= = 75,287,520. 


USING PERMUTATIONS AND COMBINATIONS TO SOLVE 
PROBABILITY PROBLEMS 


4.29 Twelve indtviduals are to be selected to serve on a jury froma group consisting of 10 females 
and 15 males. If the selection ts done in a random fashion, what is the probability that all 12 
are males? 


Ans. Twelve individuals can be selected from 25 in the following number of ways: 


25! 25 24x 23% 22K 21K 20K19xXIBXKI7xK16x 15x 14x13! 


25 — — 
Cit = Toray 12113! 


After dividing out the common factor, 13!, we obtain the following. 


95 _ 29x 24x 23x 22x 21x 20K 19K 18x 17 x 16x 15x 14 


2 = 5,200,300 
I2x11K10x9x8x7xK6x5x4x3x2xI 


The above fraction may need to be evaluated in a zigzag fashion. That is, rather than multiply the 
12 terms on top and then the {2 terms on bottom and then divide, do a multiplication, followed by 
a division, followed by a multiplication, and so on until all terms on top and bottom are accounted 


; 5! 15«14x1 ai. /t 3 
for. The jury can consist of all males in C}5 foe en eee 455 
{213! (213! 6 
ways. The probability of an all-male jury is ; ae = .000087. That is, there are about 9 chances 


out of 100,000 that an all-male jury would be chosen at random. 


4.30 The five teams in the western division of the American conference of the National Football 
League are: Kansas City, Oakland, Denver, San Diego, and Seattle. Suppose the five teams are 
equally balanced. (a) What is the probability that Kansas City, Seattle, and Denver finish the 
season in first, second, and third place respectively? (6) What is the probability that the top 
three finishers are Kansas City, Seattle, and Denver? 


Ans. (a) Since the order of finish is specified, permutations are used to solve the problem. There are 
5! 

Pi = ee = 60 different ordered ways that three of the five teams could finish the season in first, 

second, and third place in the conference. The probability that Kansas City will finish first, Seattle 


will finish second, and Denver will finish third is x = 017. 
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(b) Since the order of finish for the three teams is not specified, combinations are used to solve 
5! 
the problem. There are Cer aa 10 combinations of three teams that could finish in the top 
a2! 


three. The probability that the top three finishers are Kansas City, Seattle, and Denver is * = 10, 


Supplementary Problems 


EXPERIMENT, OUTCOMES, AND SAMPLE SPACE 


4.31 


4.32 


An experiment consists of using a 25-question test instrument to classify an individual as having either a 
type A or a type B personality. Give the sample space for this experiment. Suppose two individuals are 
classified as to personality type. Give the sample space. Give the sample space for three individuals. 


Ans. For one individual, S = {A, B}, where A means the individual has a type A personality. and B 
ineans the individual has a type B personality. 
For two individuals, S = (AA, AB, BA, BB}, where AB, for example. is the outcome that the first 
individual has a type A personality and the second individual has a type B personality. 


For three individuals, S = {AAA, AAB, ABA. ABB, BAA, BAB, BBA. BBB}. where ABA is the 
outcome that the first individual has a type A personality, the second has a type B personality, and 
the third has a type A. 


At a roadblock, state troopers classify drivers as either driving while intoxicated, driving while impaired. 
or sober. Give the sample space for the classification of one driver. Give the sample space for two 
drivers. How many outcomes are possible for three drivers? 


Ans. Let A be the event that a driver is classified as driving while intoxicated. let B be the event that a 
driver is classified as driving while impaired, and let C be the event that a driver ts classified as 
sober. 


The sample space for one driver ts S = {A, B, C}. 
The sample space for two drivers is S = {AA, AB, AC, BA, BB, BC, CA. CB. CC}. 


The sample space for three drivers has 27 possible outcomes. 


TREE DIAGRAMS AND THE COUNTING RULE 


4.33 


4.34 


An experiment consists of inspecting four items selected from a production line and classifying each one 
as defective, D, or nondefective, N. How many branches would a tree diagram for this experiment have? 
Give the branches that have exactly one defective. Give the branches that have exactly one nondefective. 


Ans. The tree would have 2* = 16 branches which would represent the possible outcomes for the 
experiment. 
The branches that have exactly one defective are DNNN, NDNN, NNDN. and NNND. 
The branches that have exactly one nondefective are NDDD, DNDD, DDND., and DDDN. 


An experiment consists of selecting one card trom a standard deck, tossing a pair of dice, and then 
flipping a coin. How many outcomes are possible for this experiment? 


Ans. According to the counting rule, there are 52 x 36 x 2 = 3,744 possible outcomes. 
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EVENTS, SIMPLE EVENTS, AND COMPOUND EVENTS 
4.35 An experiment consists of rolling a single die. What are the simple events for this experiment? 


Ans. The simple events are the outcomes {1}, {2}, {3}, {4}, {5}, and {6}, where the number in braces 
represents the number on the turned up face after the die ts rolled. 


4.36 Suppose we consider a baseball game between the New York Yankees and the Detroit Tigers as an 
experiment. Is the event that the Tigers beat the Yankees one to nothing a simple event or a compound 
event? Is the event that the Tigers shut out the Yankees a simple event or a compound event? (A shutout 
is a game in which one of the teams scores no runs.) 


Ans. The event that the Tigers shut out the Yankees one to nothing is a simple event because it 
represents a single outcome. The event that the Tigers shut out the Yankees is a compound event 
because it could be a one to nothing shutout, or a two to nothing shutout, or a three to nothing 
shutout, etc. 


PROBABILITY 


4.37 Which of the following are permissible values for the probability of the event E? 


3 3 -5 
(a) — (b) — (c) 0.0 (d) — 
4 2 7 


Ans. (a) and (c) are permissible values, since they are between 0 and | inclusive. (6) is not permissible 
because it exceeds one. (d) is not permissible because it is negative. 


4.38 An experiment is made up of three simple events A, B. and C. If P(A) = x, P(B) = y, and P(C) = z, and 
x +y+2= 1, can you be sure that a valid assignment of probabilities has been made? 


Ans. No. Suppose x = .75, y = .75, and z = -—.5, for example. Then x + y + z = 1, but this is not a valid 
assignment of probabilities. 


CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY DEFINITIONS 


4.39 IfaU-S. senator is chosen at random, what is the probability that he/she is from one of the 48 contiguous 
states? 


Ans. The are 96 senators from the 48 contiguous states and a total of 100 from the 50 states. The 


pee OH 
probability ts con 96. 


4.40 In an actuarial study, 9.875 females out of 10,000 females who are age 20 live to be 30 years old. What is 
the probability that a 20-year-old female will live to be 30 years old? 


9.875 


Ans. Using the relative frequency definition of probability, the probability is 


4.41 Casino odds for sporting events such as football games, fights, etc. are examples of which probability 
definition? 


Ans. subjective definition of probability 


MARGINAL AND CONDITIONAL PROBABILITIES 


4.42 Table 4.9 gives the joint probability distribution for a group of individuals in a sociological study. 
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Table 4.9 
Yes 55 OS 
(a) Find the marginal probability that an individual in this study is a welfare recipient. 
(b) Find the marginal probability that an individual in this study is not a high school graduate? 
(c) Tf an individual is a welfare recipient, what is the probability that he/she ts not a high school 


graduate? 
{d) Tf an individual ts a high school graduate. what ts the probability that he/she 1s a welfare recipient? 


25 
Ans. (a) .25 + .05 = .30 {c) — = 83 
30 
OS 
(hy 15 + .25 = .40 (d) gid = 083 
60 


Sixty percent of the registered voters in Douglas County are Republicans. Fifteen percent of the 
registered voters in Douglas County are Republican and have incomes above $250,000 per year. What 
percent of the Republicans who are registered voters in Douglas County have tncomes above $250,000 
per year? 


AS F ; ; ' 
Ans. aii = 25. Twenty-five percent of the Republicans have incomes above $250,000. 


MUTUALLY EXCLUSIVE EVENTS 


4.44 


With reference to the sociological study described in problem 4.42, are the events {high school graduate } 
and {welfare recipient} mutually exclusive? 


Ans. No, since 5% of the individuals in the study satisfy both events. 
Do mutually exclusive events cover all (he possibiliies in an experiment? 


Aas. No. This is true only when the mutually exclusive events are also complementary. 


DEPENDENT AND INDEPENDENT EVENTS 


4.46 


4.47 


If two events are mutually exclusive, are (hey dependent or independent? 


Ans. If A and B are two nontrivial events (that is, they have a nonzero probability) and if they are 
mutually exclusive, then P(A [| B) = 0. since tf B occurs, then A cannot occur. But since P(A) ts 
positive, P(A | B) # P(A), and the events must be dependent. 


Seventy-five percent of all Americans live in a metropolitan area. Eighty percent of all Americans 
consider themselves happy. Sixty percent of all Americans live in a metropolitan area and consider 
themselves happy. Are the events {lives in a metropolitan area} and {considers themselves happy} 
independent or dependent events? 


Ans. Let A be the event {live in a metropolitan area} and let B be the event {consider themselves 
P(A and B) _ .60 


ha .P(A|B)= 
ppy} |B) P(B) ei 


= .75 = P(A). and hence A and B are independent events. 


COMPLEMENTARY EVENTS 


4.48 


What is the sum of the probabilities of two complementary events? 
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Ans. one 


4.49 The probability that a machine used to make computer chips ts out of control is .0O01. What is the 
complement of the event that the machine is out of control and what is the probability of this event? 


Ans. The complementary event is that the machine is in control. The probability of this event is .999, 


MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS 


4,50 In a particular state, 20 percent of the residents are classified as senior citizens. Sixty percent of the 
senior citizens of this state are receiving social security payments. What percent of the residents are 
senior citizens who are receiving social security payments? 


Ans. Let A be the event that a resident is a senior citizen and let B be the event that a resident is 
receiving social security payments. We are given that P(A) = .20 and P(B | A) = .60. P(A and B) = 
P(A) P(B | A) = .20 x .60 = .12. Twelve percent of the residents are semior citizens who are 
receiving soctal security payments. 


4.51 IfE,,E,,...,£, are n independent events, then P(E; and E, and E;,... and E,) is equal to the product 
of the probabilities of the n events. Use this probability rule to answer the following. 
(a) Find the probability of tossing five heads in a row with a coin. 
(6) Find the probability that the face 6 turns up every time in four rolls of a die. 
(c) If 43 percent of the population approve of the president's performance, what is the probability that 
ali 10 individuals in a telephone poll disapprove of his performance. 


Ans. (a) (.5)° = .03125 
(b) (2 "= .00077 


(c) (.57)'° = 00362 
ADDITION RULE FOR THE UNION OF EVENTS 
4.52 Events A and B are mutually exclusive and P(A) = .25 and P(B) = .35. Find P(A and B) and P(A or B). 
Ans. Since A and B are mutually exclusive, P(A and B) = 0. Also P(A or B) = .25 + .35 = .60. 


4.53 Fifty percent of a particular market own a VCR or have accll phone. Forty percent of this market own a 
VCR. Thirty percent of this market have a cell phone. What percent own a VCR and have a cell phone? 


Ans. P(A and B) = P(A) + P(B) — P(A or B) = .3 + .4 - .5 = .2, and therefore 20% own a VCR and a 
cell phone. 


BAYES’ THEOREM 


4.54 In Arkansas, 30 percent of all cars emit excessive amounts of pollutants. The probability is 0.95 that a car 
emitting excessive amounts of pollutants will fail the state’s vehicular emission test, and the probability is 
0.15 that a car not emitting excessive amounts of pollutants will also fail the test. If a car fails the 
emission test, what ts the probability that it actually emits excessive amounts of emissions? 


Ans. Let A be the event that a car emits excessive amounts of pollutants, and B_ the event that a car fails 
the emission test. Then, P(A) = .30. P(A‘) = .70, P(B | A) = .95, and P(B | AS) = .15. We are asked 
to find P(A | B). The three-step procedure results in P(A | B) = .73. 


88 


4.55 


PROBABILITY {CHAP. 4 


In a particular community, 15 percent of all adults over 50 have hypertension. The health service in this 

community correctly diagnoses 99 percent of all such persons with hypertension. The health service 

incorrectly diagnoses 5 percent who do not have hypertension as having hypertension. 

(a) Find the probability that the health service will diagnose an adult over 50 as having hypertension. 

(b) Find the probability that an individual over 50 who ts diagnosed as having hypertension actually has 
hypertension. 


Ans. Let A be the event that an individual over 50 in this community has hypertension, and let B be the 
event that the health service diagnoses an individual over 50 as having hypertension. Then, P(A) = 
.15, P(A‘) = .85, P(B | A) = .99, and P(B | A‘) = .05. 
(a) P(B)=.15 x .99 + .85 x .05 =.19 
(b) Using the three-step procedure, P(A | B) = .78 


PERMUTATIONS AND COMBINATIONS 


4.56 


4.57 


4.58 


How many ways can three letters be selected from the English alphabet if: 

(a) The order of selection of the three letters is considered important, 1.e., abc is different from cba, for 
example. 

(b) The order of selection of the three letters is not important? 


Ans. (a) 15,600 (b) 2,600 


The following teams comprise the Atlantic division of the Eastern conference of the National Hockey 

League: Florida, Philadelphia, N.Y. Rangers, New Jersey, Washington, Tampa Bay, and N.Y. Islanders. 

(a) Assuming no teams are tied at the end of the season, how many different final standings are possible 
for the seven teams? 

(b) Assuming no ties, how many different first-, second-, and third-place finishers are possible? 


Ans. (a) 5,040 (6) 210 


A criminologist selects five prison inmates from 30 volunteers for more intensive study. How many such 
groups of five are possible when selected from the 30? 


Ans. 142,506 


USING PERMUTATIONS AND COMBINATIONS TO SOLVE PROBABILITY PROBLEMS 


4.59 


4.60 


Three individuals are to be randomly selected from the 10 members of a club to serve as president, vice 
president, and treasurer. What is the probability that Lana is selected for president, Larry for vice 
president, and Johnny for treasurer? 


Ans. me = ak = 00138 
P3 720 


A sample of size 3 is selected from a box which contains two defective items and 18 nondefective items. 
What is the probability that the sample contains one defective item? 


Ccrxcy® _ 2x153 _ 306 _ 


= = .268 
ce 1140 1,140 


Ans. 


Chapter 5 


Discrete Random Variables 


RANDOM VARIABLE 


A random variable associates a numerical value with each outcome of an experiment. A random 
variable is defined mathematically as a real-valued function defined on a sample space, and is 
represented as a letter such as X or Y. 


EXAMPLE 5.1 For the experiment of flipping a coin twice, the random variable X is defined to be the number 
of tails to appear when the experiment is performed. The random variable Y is defined to be the number of 
heads minus the number of tails when the experiment is conducted. Table 5.1 shows the outcomes and the 
numerical value each random variable assigns to the outcome. These are two of the many random variables 
possible for this experiment. 


Table 5.1 
HH 0 2 
HT | 0 
TH | 0 
2 2 


TT 


EXAMPLE 5.2 An experimental study involving diabetics measured the following random variables: fasting 
blood sugar, hemoglobin, blood pressure, and triglecerides. These random variables assign numerical values to 
each of the individuals in the study. The numerical values range over different intervals for the different random 
variables. 


DISCRETE RANDOM VARIABLE 


A random variable is a discrete random variable if it has either a finite number of values or 
infinitely many values that can be arranged in a sequence. We say that a discrete random variable 
may assume a countable number of values. Discrete random variables usually arise from an 
experiment that involves counting. The random variables given in Example 5.1 are discrete, since 
they have a finite number of different values. Both of the variables are associated with counting. 


EXAMPLE 5.3 An experiment consists of observing 100 individuals who get a flu shot and counting the 
number X who have a reaction. The variable X may assume 101 different values from O to 100. Another 
experiment consists of counting the number of individuals W who get a flu shot until an individual gets a flu 
shot and has a reaction. The variable W may assume the values 1, 2, 3, .... The variable W can assume a 
countably infinite number of values. 


CONTINUOUS RANDOM VARIABLE 


A random variable is a continuous random variable if it is capable of assuming all the values in 
an interval or in several intervals. Because of the limited accuracy of measuring devices, no random 
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variables are truly continuous. However, we may treat random variables abstractly as being 
continuous. 


EXAMPLE 5.4 The following random variables are considered continuous random variables: survival time of 
cancer patients, the time between release from prison and conviction for another crime, the daily milk yield of 
Holstein cows, weight loss during a dietary routine, and the household incomes for single-parent households in a 
sociological study. 


PROBABILITY DISTRIBUTION 


The probability distribution of a discrete random variable X is a list or table of the distinct 
numerical values of X and the probabilities associated with those values. The probability distribution 
is usually given in tabular form or in the form of an equation. 


EXAMPLE 5.5 Table 5.2 lists the outcomes and the values of X, the sum of the up-turned faces for the 
experiment of rolling a pair of dice. Table 5.2 is used to build the probability distribution of the random variable 
X. This table lists the 36 possible outcomes. Only one outcome gives a value of 2 for X. The probability that 
X=2 is | divided by 36 or .028 when rounded to three decimal places. We write this as P(2) = .028. The 
probability that X = 3, P(3), ts equal to 2 divided by 36 or .056. The probability distribution for X is given in 
Table 5.3. 


Table 5.2 


Value of X Value of 


CoM AMR WTA PB WwW PY 
Sewmrannoairnans 


Table 5.3 


| P(x) | 028.056.083.111 139.167 139”. 083 056 .028 


The probability distribution, P(x) = P(X = x) satisfies formulas (5./) and (5.2). 
P(x) 20 for each value x of X (5.1) 
x P(x) = 1 where the sum is over all values of X (5.2) 
Notice that the values for P(x) in Table 5.3 are all positive, which satisfies formula (5./), and 


that the sum equals | except for rounding errors. 


x 
EXAMPLE 5.6 P(x) = iG , x= 1,2, 3,4 1s a probability distribution since P(1) = .1, P(2) = .2, P(3) = .3, and 
P(4) = .4 and (5./) and (5.2) are both satisfied. 
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EXAMPLE 5.7 It is known from census data that for a particular income group that 10% of households have 
no children, 25% have one child, 50% have two children, 10% have three children, and 5% have four children. 
If X represents the number of children per houschold for this income group, then the probability distribution of 
X is given in Table 5.4. 


Table 5.4 


| x | 0 1 2 3 4 
10 25 50 10.05 


The event X > 2 is the event that a household in this income group has at least two children and means that 
X = 2, or X = 3, or X = 4. The probability that X 2 2 is given by 


P(X > 2) = P(X = 2) + P(X = 3) + P(X =4) = 504 104+ .05 = 65 


The event X < | is the event that a household in this income group has at most one child and is equivalent 
to X = 0, or X = |. The probability that X < | is given by 


PX <1 =P(X=0) 4+ P(X = 1) = .10 4+ .25 = .35 


The event 1 < X < 3 is the event that a household has between one and three children inclusive and is 
equivalent to X = 1, or X = 2, or X = 3. The probability that | < X $ 3 is given by 


P(L SX $3)= P(X = 1) 4+ P(X = 2) 4+ P(X = 3) = 25+ 50+ 10=.85 


The above discussion may be summarized by stating that 65% of the households have at least two children, 
35% have at most one child, and 85% have between one and three children inclusive. 


MEAN OF A DISCRETE RANDOM VARIABLE 


The mean of a discrete random variable is given by 
ti = Ex P(x) where the sum ts over all values of X (5.3) 


The mean of a discrete random variable is also called the expected value, and is represented by E(X). 
The mean or expected value will also often be referred to as the population mean. Regardless of 
whether it is called the mean, the expected value, or the population mean, the numerical value is 
given by formula (5.3). 


EXAMPLE 5.8 The mean of the random variable in Example 5.5 is found as follows. 


A] ae 167 139 AL 083 056 028 
255 834 1.169 1.112 —.999 830 .616 .336 


The sum of the row labeled xP(x) is the mean of X and is equal to 7.007. If fractions are used in place of 
decimals for P(x), the value will equal 7 exactly. In other words E(x) = 7. The long-term average value for the 
sum on the dice is 7. The mean value of the sum on the dice for the population of all possible rolls of the dice 
equals 7. If you were to record all the rolls of the dice at Las Vegas, the average value would equal 7. 


EXAMPLE 5.9 The mean number of children per household for the distribution given in Example 5.7 is found 
as follows. 
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The sum of the row labeled xP(x), 1.75, is the mean of the distribution. For this population of households, the 
mean number of children per household is 1.75. Notice that the mean does not have to equal a value assumed by 
the random variable. 


STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE 
The variance of a discrete random variable is represented by o° and is defined by 
oO = L(x —p) P(x) (5.4) 
The variance is also represented by Var(X) and may be calculated by the alternative formula given by 
Var(X) =O =L x P(x)- pw (5.5) 


The standard deviation of a discrete random variable is represented by 6 or sd(X) and is given 


by 
o = sd (X) = J Var(X) (5.6) 


EXAMPLE 5.10 A social researcher is interested in studying the family dynamics caused by gender makeup. 
The distribution of the number of girls in families consisting of four children is as follows: 6.25% of such 
families have no girls, 25% have one girl, 37.5% have two girls, 25% have three girls, and 6.25% have four 
girls. Table 5.5 illustrates the computation of the variance of X, the number of girls tn a family having four 
children. The standard deviation is the square root of the variance and equals one. 


Table 5.5 


a EV TET 


Table 5.6 illustrates the computation of the components needed when using the alternative formula (5.5) to 


compute the variance of X. Using formula (5.5), the variance is Var(x) = 2 x? P(x) — Te = 5-4 = 1, and the 
standard deviation is also equal to one. We sce that formulas (5.4) and (5.5) give the same results for the 
standard deviation. The mean number of girls in families of four children equals 2 and the standard deviation is 
equal to I. 


Table 5.6 


0.0625 
0.25 
0.375 
0.25 
0.0625 


TOE 
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BINOMIAL RANDOM VARIABLE 


A binomial random variable is a discrete random variable that is defined when the conditions of 
a binomial experiment are satisfied. The conditions of a binomial experiment are given in Table 5.7. 


Table 5.7 


Conditions of a Binomial Experiment 


There are n identical trials. 

Each trial has only two possible outcomes. 

The probabilities of the two outcomes remain constant for each trial. 
The trials are independent. 


awWN 


The two outcomes possible on each trial are called success and failure. The probability 
associated with success is represented by the letter p and the probability associated with failure ts 
represented by the letter q, and since one or the other of success or fatlure must occur on each trial, p 
+ q must equal one, i.e., p + q = 1. When the conditions of the binomial experiment are satisfied, the 
binomial random variable X is defined to equal the number of successes to occur in the n trials. The 
random variable X may assume any one of the whole numbers from zero to n. 


EXAMPLE 5.11 A balanced coin is tossed 10 times, and the number of times a head occurs is represented by 
X. The conditions of a binomial experiment are satisfied. There are n = 10 identical trials. Each trial has two 
possible outcomes, head or tail. Since we are interested in the occurrence of a head on a trial, we equate the 
occurrence of a head with success, and the occurrence of a tail with failure. We see that p = .5 and q = .5. Also, 
it is clear that the trials are independent since the occurrence of a head on a given toss is independent of what 
occurred on previous tosses. The number of heads to occur in the 10 tosses, X. can equal any whole number 
between 0 and 10. X is a binomial random variable with n = 10 and p =.5. 


EXAMPLE 5.12 A balanced die is tossed five times, and the number of times that the face with six spots on it 
faces up is counted. The conditions of a binomial experiment are satisfied. There are five identical trials. Each 
trial has two possible outcomes since the face 6 turns up or a face other than 6 turns up. Since we are interested 


in the face 6, we equate the face 6 with success and any other face with failure. We see that p = = and q = , ; 


Also, the outcomes from toss to toss are independent of one another. The number of times the face 6 turns up, X, 
can equal Q, 1, 2, 3.4, or 5. X is a binomial random variable with n = 5 and p = .167. 


EXAMPLE 5.13 A manufacturer uses an injection mold process to produce disposable razors. One-half of one 
percent of the razors are defective. That is, on average, 500 out of every 100,000 razors are defective. A quality 
control technician chooses a daily sample of 100 randomly selected razors and records the number of defectives 
found in the sample in order to monitor the process. The conditions of a binomial experiment are satisfied. 
There are {00 identical trials. Each trial has two possible outcomes since the razor is either defective or non- 
defective. Since we cre recording the number of defectives, we equate the occurrence of a defective with success 
and the occurrence of a nondefective with failure. We see that p = .0O5 and q = .995. The number of defectives 
in the 100, X, can equal any whole number between 0 and 100. X ts a binomial random variable with n = 100 
and p = .005. 


BINOMIAL PROBABILITY FORMULA 


The binomial probability formula is used to compute probabilities for binomial random 
variables. The binomial probability formula is given in 


n I 
P(x) { jar = ——p'q"""" forx=0, 10.240 (5.7) 
Xx 


x'(n—x)! 
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n 

The symbol | | is discussed in Chapter 4 and represents the number of combinations possible when 
x 


x items are selected from n. 


EXAMPLE 5.14 A die is rolled three times and the random variable X is defined to be the number of times 
the face 6 turns up in the three tosses. X is a binomial random variable that assumes one of the values 0, 1. 2, or 
3. In this binomial experiment, success occurs on a given trial if the face 6 turns up and failure occurs if any 


other face turns up. The probability of success is p = : = .167 and the probability of failure is q = ; = .833. In 


order to help understand why the binomial probability formula works, the binomial probabilities will be 
computed using the basic principles of probability first and then by using formula (5. 7). 

To find P(O), note that X = O means that no successes occurred, that is, three fatlures occurred. Because of 
the independence of trials. the probability of three failures is .833 x .833 x .833 = (.833)' = .578. That is. the 
probability that the face 6 does not turn up on any of the three tosses ts .578. 

To find P(1). note that X = | means that one success and two failures occurred. One success and two 
failures occur if the sequence SFF, or the sequence FSF, or the sequence FFS occurs. The probability of SFF is 
.167 x .£833 x .833 = .1159. The probability of FSF is 833 x .167 x .833 = 1159. The probability of FFS is 
$33 x .833 x .167 = .1159. The three probabilities for SFF, FSF, and FFS are added because of the addition 
rule for mutually exclusive events. Therefore, P(X = 1) = P(1) = 3 x .1159 = .348. The probability that the face 
6 turns up on one of the three (osses is .348. 

To find P(2), note that X = 2 means that two successes and one failure occurred. Two successes and one 
failure occur if the sequence FSS or the sequence SFS or the sequence SSF occurs. The probability of FSS is 
833 x .167 x .167 = .0232. The probability of SFS is .167 x .833 x .167 = .0232. The probability of SSF is 
167 x .167 x .833 = .0232. The probabilities for FSS, SFS, and SSF are added because of the addition rule for 
mutually exclusive events. Therefore, P(X = 2) = P(2) = 3 x .0232 = .070. The probability that the face 6 turns 
up on two of the three tosses is .070. 

To find P(3), note that X = 3 means that three successes occurred. The probability of three consecutive 
successes is .167 x .167 x .167 = (.167)' = .005. There are five chances in a thousand of the face 6 turning up on 
each of the three tosses. 

Using the binomial probability formula, we find the four probabilities as follows: 

' 
P(O) = 161) (.833)° = (.833)' = .578 


3 ; 
Pi 1) = ——(.167)! (.833)? = 3 x .167 x (.833) = .348 
{121 
3f 


P(2) = ——~(.167)° (.833)! = .070 
2")! 


3! 
P(3) = ——(.167)* (.833)° = .00S 
310! 


This example illustrates how much work the binomial probability formula saves us when solving problems 
involving the binomial! distribution. The distribution for the variable in Example 5.14 is given in Table 5.8. 


Table 5.8 


| x | 0 1 2 3 
578 348.070 005 


In formula (5.7), The term p‘q'"~ “’ gives the probability of x successes and (n — x) failures. The 
n! aon : : : 
term ———— counts the number of different arrangements which are possible for x successes and 
x'(n— x)! 


(n ~ x) failures. The role of these terms is illustrated in Example 5.14. 
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EXAMPLE 5.15 Fifty-seven percent of companies in the U.S. use networking to recruit workers. The 
probability that in a survey of ten companies exactly half of them use networking to recruit workers ts 


10! ee 
P(5) = — (.57)° (.43)° = .223 
515! 


TABLES OF THE BINOMIAL DISTRIBUTION 


Appendix | contains the table of binomial probabilities. This table lists the probabilities of x for 
n= 1 ton = 25 for selected values of p. 


EXAMPLE 5.16 If X represents the number of girls in families having four children, then X is a binomial 
random variable with n = 4 and p =.5. Using formula (5.7), the distribution of X ts determined as follows: 


' 


4) it ah ; 4! ‘ 4! eee 
P(O0) = —— (5) C5) = (5) = .0625 Ppa = 05) (5) 525 PQ) = GAY G5) S375 
0!4! 113! Dro! 


4} 4! 
P(3) = — (.5)'(.5)' = .25 P(4) = —— (.5)'(.5)° = .0625 
a 4!0! 


Table 5.9 contains a portion of the table of binomial probabilities found in Appendix 1. The numbers in bold 
print indicates the portion of the table from which the binomial probability distribution for X ts obtained. The 
probabilities given are the same ones obtained by using formula (5.7). 


Table 5.9 


n a3 .40 -50 .60 oes 


X 


EXAMPLE 5.17 Eighty percent of the residents in a large city feel that the government should allow more 
than one company to provide Jocal telephone service. Using the table of binomial probabilities, the probability 
that at Jeast five in a sample of ten residents fec! that the government should allow more than one company to 
provide local telephone service is found as follows. The event “at least five” means five or more and is 
equivalent to X = 5 or X=60r X=7 or X= 8 0r X =9 or X = 10. The probabilities are added because of the 
addition law for mutually exclusive events. 


P(X = 5) = P(5) + P(6) + P(7) + P(8) + P(9) + PIO) 
P(X > 5) = .0264 + .0881 + .2013 + .3020 + .2684 + .1074 = .9936 


Statistical software is used to perform binomial probability computations and to some extent has 
rendered binomial probability tables obsolete. Minitab contains routines for computing binomial 
probabilities. Example 5.18 illustrates how to use Minitab to compute the probability for a single 
value or the total distribution for a binomial random variable. 


EXAMPLE 5.18 The following Minitab output shows the binomial probability computations given in 
Examples 5.15 and 5.16. The binomial probabilities are shown in bold type. 
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MTB > # Minitab computation for Example 5.15 # 
MTB > pdf 5; 
SUBC > binomial n = 10 p= .57. 


Probability Density Function 
Binomial! with n = 10 and p = 0.570000 


X P(X = x) 
5.00 0.2229 


MTB > # Minitab computation for Example 5.16 # 
MTB > pdf; 
SUBC > binomial n = 4 and p = .5. 


Probability Density Function 
Binomial with a = 4 and p = 0.500000 


x P(X = x) 
0 0.0625 
1 0.2500 
2 0.3750 
3 0.2500 
4 0.0625 


MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE 


The mean and variance for a binomial random variable may be found by using formulas (5.3) and 
(5.4). However, in the case of a binomial random variable, shortcut formulas exist for computing the 
mean and standard deviation of a binomial random variable. The mean of a binomial random vartable 
is given by 

=np (5.8) 


The variance of a binomial random variable is given by 
Oo = npq (5.9) 
EXAMPLE 5.19 The mean for the binomial distribution given in Example 5.18 using formula (5. 3) is 
p= xP(x)=Ox 06254 1x .2542x .3754+3x .25+4x 0625 =2 
The mean using the shortcut formula (5.8) is 
w=np=4x.5=2 
The variance for the binomial distribution given in Example 5.18 using formula (5.4) is 
o =D x P(x)-w=0x 0625 + 1x .254+4x .3754+9x .25 + 16x .0625-4=1 
The variance using the shortcut formula (5.9) is 
oO = npg =4x(.5)x(.5)=1 


EXAMPLE 5.20 Chemotherapy provides a 5-year survival rate of 80% for a particular type of cancer. In a 
group of 20 cancer patients receiving chemotherapy for this type of cancer, the mean number surviving after 5 
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years is fp = 20 x .8 = 16 and the standard deviation is o= ¥20 x 8 x .2 = 1.8. On the average, !6 patients 
will survive, and typically the number will vary by no more than two from this figure. 


POISSON RANDOM VARIABLE 


The binomial random variable is applicable when counting the number of occurrences of an 
event called success in a finite number of trials. When the number of trials is large or potentially 
infinite, another random variable called the Poisson random variable may be appropriate. The 
Poisson probability distribution is applied to experiments with random and independent occurrences 
of an event. The occurrences are considered with respect to a time interval, a length interval, a fixed 
area or a particular volume. 


EXAMPLE 5.21 The number of calls to arrive per hour at the reservation desk for Regional Airlines is a 
Poisson random variable. The calls arrive randomly and independently of one another. The random variable X is 
defined to be the number of calls arriving during a time interval equal to one hour and may be any number from 
Q to some very large value. 


EXAMPLE 5.22 The number of defects in a 10-foot coil of wire is a Poisson random variable. The defects 
occur randomly and independently of one another. The random variable X is defined to be the number of defects 
in a 10-foot coil of wire and may be any number between 0 and some very large value. The interval in this 
example is a length interval of 10 feet. 


EXAMPLE 5.23 The number of pinholes in 1-yd’ pieces of plastic is a Poisson random variable. The pinholes 
occur randomly and independently of one another. The random variable X is defined to be the number of pin- 
holes per square yard piece and can assume any number between 0 and a very large value. The interval in this 
example is an area of I yd’. 


POISSON PROBABILITY FORMULA 


The probability of x occurrences of some event in an interval where the Poisson assumptions of 
randomness and independence are satisfied ts given by formula (5.70), where A is the mean number 
of occurrences of the event in the interval and the value of e is approximately 2.71828. The value of e 
is found on most calculators and powers of e are easily evaluated. Tables of Poisson probabilities are 
found in many statistical texts. However, with the wide spread availability of calculators and 
statistical software, they are being less widely utilized. 


XA7A 
P(x) = 


forx=0,1,2,... (5.10) 


x! 
The mean of a Poisson random variable is given by 
=r (5.11) 
The variance of a Poisson random variable is given by 
=A (5.12) 
EXAMPLE 5.24 The number of small pinholes in sheets of plastic are of concern to a manufacturer. If the 


number of pinholes is too large, the plastic is unusable. The mean number per square yard is equal to 2.5. The I- 
yd’ sheets are unusable if the number of pinholes exceeds 6. The probability of interest is P(X > 6), where X 
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represents the number of pinholes in a |-yd? sheet. This probability is found by finding the probability of the 
complementary event and subtracting it from 1. 


P(X > 6) = 1 ~ P(X <6) 


The command cdf of Minitab can be used to find P(X < 6). The following Minitab output illustrates how 
this is accomplished. 


MTB > cdf 6; 
SUBC > poisson mean = 2.5. 


Cumulative Distribution Function 
Poisson with mu = 2.50000 


x P(X <= x) 
6.00 0.9858 


P(X > 6) = 1 — P(X S$ 6) = 1 — .9858 = .0142 


By multiplying .0142 by 100 we see that 1.42% of the plastic sheets are unusable. 


EXAMPLE 5.25 The Poisson distribution approximates the binomial distribution closely when n 2 20 and p < 
.05. A machine produces items of which 1% are defective. In a sample of 150 items selected from the output of 
this machine, the probability of two or fewer defectives in the sample is P(X $ 2) and is found by the command 
cdf 2; when using Minitab. The following Minitab output gives the binomial probability that X < 2. 


MTB > cdf 2; 
SUBC > binomial n = [50 p= .O1. 


Cumulative Distribution Function 
Binomial with n = {50 and p = 0.0100000 


X P(X <= x) 
2.00 0.8095 


The mean number of defectives in a sample of 150 is np = 150 x .0) = 1.5. The Poisson probability that X < 2 
where A = 1.5 is given by the following Minitab output. 


MTB > cdf 2; 
SUBC > poisson mean = 1.5. 


Cumulative Distribution Function 
Poisson with mu = 1.50000 


x P(X <=x) 
2.00 0.8088 


The binomial probability of the event X < 2 ts .8095 and the Poisson approximation is .8088, Notice that the 
Poisson approximation is very close to the binomial probability. 


HYPERGEOMETRIC RANDOM VARIABLE 


The Aypergeometric random variable is used in situations where success or failure is possible on 
each trial but where there is not independence from trial to trial. The lack of independence from trial 
to trial distinguishes the hypergeometric distribution from the binomial distribution. The hyper- 
geometric random variable applies in situations where there are N items, of which k are classified as 
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successes and N — k are classified as failures. A sample of size n <k is selected from the N items and 
X is defined to equal the number of successes in the n items selected. X is a hypergeometric random 
variable which can equal any whole number from 0 to n. 


EXAMPLE 5.26 A sociologist randomly selects 5 individuals from a group consisting of 10 male single 
parents and 15 female single parents. The random variable X is defined to equal the number of male single 
parents in the 5 selected individuals. In this example, N = 25, k = 10, N-k = J5, and n= 5. The number of male 
single parents in the 5 selected is a hypergeometric random variable. If this hypergeometric random vartable is 
represented by X, then X may assume any one of the values 0, 1, 2. 3,4, or 5. 


EXAMPLE §.27 A box contains 5 defective and 25 acceptable computer monitors. Three of the monitors are 
randomly selected and X is defined to be the number of defective monitors in the three. X is a hypergeometric 
random variable with N = 30, k =5,N-— k= 25, and n= 3. X may assume any one of the values 0, I, 2, or 3. 


HYPERGEOMETRIC PROBABILITY FORMULA 


When n items are selected from N items of which k are successes and N — k are failures, the 
random variable X, defined to equal the number of successes in the n selected items, is a 
hypergeometric random variable. The probability distribution of X is given by the Aypergeometric 
probability formula shown in formula (5. /3). 


(E(B) k! (N -k)! 
P(x) = el ae eee J = xMk-x)i (n—x)(N-k =n x)! for K = Q, | neg te n (5.12) 


N ee ee 
"| n\(N —n)! 


EXAMPLE 5.28 A police department consists of 25 officers of whom 5S are minorities. Three officers are 
randomly selected to meet with the mayor. Let X be the number of minorities in the three selected to meet with 
the mayor. X is a hypergeometric random variable with N = 25. k = 5, N — k = 20. and n = 3. The probability 
distribution of X is derived as follows: 


5 20 5 20 
o}*( 3) tx a40 lo} sx 190 
Set NE eG xs 2 


P(Q) = = = 4¢ P( |) = —-~-——- = = 4]3 
wy 25 2300 a 25 2300 
3 3 
5 20) 5 20 
2) li} tox 20 3" h0}  10x«1 
P(Q2) = —————_ = ——— = ..087 PQ) = ———. = —— _ = .004 
25 2300 25 2300 
3 3 


The probability distribution of X is given in Table 5.10. It is highly likely that at most one of the three will be a 
minority. The probability that X < fis .909. 


Table 5.10 


| x | 0 1 2 3 
496 413. «087 ~=—.004 


EXAMPLE 5.29 The binomial distribution approximates the hypergeometric distribution whenever n < .OSN. 
A box contains 200 computer chips, of which 7 are defective. The probability of finding one defective in a 
sample of 5 randomly selected chips is given by the following hypergeometric probability computation. 
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| ("| 
x 
4} 7x 56,031,760 _ 


PhS Se = 
ee 2,535,650,040 


5 


Since n S$ .OS x 200 = 10, the probability may be approximated by using the binomial distribution. The five 
selections of the computer chips may be viewed as n = 5 trials. The probability of success, selecting a defective 


chip, is p = a = .035, and q = .965. The binomial probability of one defective in the five chips is given as 


follows: 


5! 
P(1) = re (.035)' (965) = .152 


The approximation is very good when n S$ .OSN, as shown in this example. 


Solved Problems 


RANDOM VARIABLE 


5.1 Let X represent the number of boys in families having three children. List all possible birth 
order permutations for families having three children and give the value of X for each 
outcome. 


Ans. ‘Table 5.11 gives the outcomes and the value X assigns to cach outcome. 


Table 5.11 


Value of X 


5.2 For the experiment of rolling three dice, X is defined to be the sum of the three dice. What are 
the unique values assumed by X? 


Ans. The values for X range from 3, corresponding to the outcome (1, I, 1) to 18, corresponding to the 
outcome (6, 6, 6). The unique values are 3, 4, 5,6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16, 17, and 18. 


DISCRETE RANDOM VARIABLE 


5.3. A telemarketing company administers an aptitude test consisting of 25 problems to potential 
employees. A variable of interest to the company is X, the number of problems worked 
correctly. How many different values are possible for X? 
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Ans. 26, since an individual can get from 0 to 25 correct. 


5.4 A random variable assigns a single number to each outcome of an experiment. Is it true that 
each value of a random variable corresponds to a single outcome? 


Ans. No. Consider the outcomes and values of X given in Table 5.11. The value X = | corresponds to 
the three outcomes BGG, GBG, and GGB for example. 


CONTINUOUS RANDOM VARIABLE 


5.5 Identify the continuous random variables in parts (a) through (e). 
(a) The time that an individual ts logged onto the internet during a given week 
(b) The number of domestic violence calls responded to per day by the Chicago police 
department 
(c) The mortality rate for women who used estrogen therapy for at least a year starting in 1969 
(d) The daily room rate for luxury/upscale hotels in the U.S. 
(e) The number of executions per state since 1975 


Ans. Parts (a) and (d) are continuous random variables. Time and money are almost always considered 
continuous even though in practice they are probably discrete. 


5.6 Is it possible to give a probability value to each individual value of a continuous random 
variable? 


Ans. No. It is not possible to give an individual probability to each value of a continuous variable since 
the variable may assume an uncountably infinite number of different values. Instead, probabilities 
are assigned (o an interval of values for a continuous random variable. 


PROBABILITY DISTRIBUTION 


5.7 According to the registrar’s office at the University of Nebraska at Omaha (UNO), during the 
current semester, 9% of the students are registered for 3 credit hours, 13% are registered for 6 
credit hours, 16% are registered for 9 credit hours, 21% are registered for 12 credit hours, 26% 
are registered for 15 credit hours, 13% are registered for 18 credit hours, and 2% are registered 
for 21 credit hours. If the random variable X represents the number of credit hours per student 
at UNO, give the probability distribution for X. 


Ans. The probability distribution for X ts given in Table 5.12. 


Table 5.12 


=e 6 9 12 15 18 21 
09 13 16 24 26 13 02 


5.8 An experiment consists of rolling a die and flipping a coin. The coin has the number 1| stamped 
on one side and the number 2 stamped on the other side. The random variable Y is defined to 
equal the sum of the number showing on the coin plus the number showing on the die after the 
experiment is conducted. Give the probability distribution for Y. 
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Ans. Table 5.13 gives the outcomes for the experiment as well as the values the random variable Y 
assigns to the outcomes. From Table 5.13, the probability distribution given in Table 5.14 is 
determined. 


Table 5.13 


(coin = 1, dte = 1) 
(coin = [, die = 2) 
(coin = 1, die = 3) 
(coin = 1, die = 4) 
(coin = |, dite = 5) 


to 


AB 


(coin = 1, die = 6) 
(coin = 2, die = 1) 
(coin = 2, die = 2) - 
(coin = 2, die = 3) 
(coin = 2, die = 4) 
(coin = 2, die = 5) 
coin = 2, dic = 6 


a ae 


en nrumn hb 


Y = 2 corresponds to one of twelve equally likely outcomes. Therefore P(2) equals “ = .083. Y = 
3 corresponds to two of twelve equally likely outcomes. Therefore P(3) equals . = .167. The 


other probabilities are found in a similar fashion. 


Table 5.14 


a ee ee eee ee ee ee ee 


083.167, 167 «167167167083 


MEAN OF A DISCRETE RANDOM VARIABLE 


5.9 A roulette wheel has 18 red, 18 black, and 2 green slots. You bet $10 on red. If red comes up, 
you get your $10 back plus $10 more. If red does not come up, you lose your $10. Let random 
variable P represent your profit when playing this roulette wheel. Find the mean value of P. 
Ans. The distribution of P is given in table 5.15. The probability of a $10 loss, 1.e., P = -10, is = 


526, and the probability of a $10 gain is = = 474 


Table 5.15 
| op | -10 10 | 
526 474 


The mean profit is p= Z xP(x) = -10 x .526 + 10 x .474 = —.52. Your average loss is 52 cents per 
play of the roulette wheel. 


5.10 The distribution of the number of children per household for households receiving Aid to 
Dependent Children (ADC) in a large eastern city is as follows: Five percent of the ADC 
households have one child, 35% have 2 children, 30% have 3 children, 20% have 4 children, 
and 10% have 5 children. Find the mean number of children per ADC household in this city. 


Ans. The mean is p = £xP(x) = 1 x 05 +2 x .354+43x .30+4x .20+5 x .10 = 2.95. The mean is 
about 3 per household. 
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STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE 


5.11 Find the standard deviation of the profit when playing the color red on the roulette wheel 
described in problem 5.9. 


Ans. The variance is given by £ x’P(x) - yw” = 100 x .526 + 100 x .474 — .2704 = 99.7296, and the 
standard deviation is ¥99.7296 = $9.97. 


5.12 Find the standard deviation of the number of children per ADC household for the distribution 
given in problem 5.10. 


Ans. The variance is given by © x2P(x) - pw? = 1 x .054+4 x .35 +9 30+ 16x .20+ 25 x .10- 2.95? = 
1.1475 and the standard deviation is ¥1.1475 = 1.07. 


BINOMIAL RANDOM VARIABLE 


5.13 Ninety percent of the residents of Stanford, California, twenty-five years of age or older have 
at least a bachelor’s degree. Three hundred residents of Stanford twenty-five years or older are 
selected for a poll concerning higher education. If X represents the number in the 300 who 
have at least a bachelor’s degree, give the conditions necessary for X to be a binomial random 
variable and identify n, p, and q. 


Ans. The main condition we need to be concerned about is that the 300 residents be selected 
independently. There are 300 identical trials and each trial has only two possible outcomes. We 
may identify success with having at least a bachelor’s degree and failure with having less than a 
bachelor’s degree. If the respondents are chosen randomly and independently, then the 
probabilities of success and failure should remain constant. Pollsters use standard techniques to 
ensure independence of individuals selected. The values of n, p, and q are 300, .90, and .10 
respectively. 


5.14 A box contains 20 items, of which 25% are defective. Three items are randomly selected and 
X, the number of defectives in the three selected items, is determined. Explain why X is not a 
binomial random variable with n = 3 and p =.25. 


Ans. The probabilities p and q do not remain constant from trial to trial. Suppose we are interested in 
the probability that X = 3; that is, we are interested in the probability of getting three consecutive 
defectives. The probability that the first one is defective is .25. The probability that the second is 


defective after selecting a defective on the first selection is = = .21, and the probability that the 


third is defective after selecting defectives on the first two selections ts = = .17. The binomial 


model assumes that the probability p remains constant at p = .25. In this experiment, p does not 
remain constant, but changes from .25 to .21 to .17. 


BINOMIAL PROBABILITY FORMULA 


5.15 Approximately 12% of the U.S. population is composed of African-Americans. Assuming that 
the same percentage is true for telephone ownership, what is the probability that when 25 
phone numbers are selected at random for a small survey, that 5 of the numbers belong to an 
African-American family? 
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Ans. Let X represent the number of phone numbers in the 25 belonging to African-Americans. Then, X 
has a binomial distribution with n = 25 and p = .12, The probability P(X = 5) is given as follows: 


25! 5 30 
P(X = 3) = —— (12) (.88)" = .1025 
5!20! 


§.16 It is estimated that 42% of women ages 45 to 54 are overweight. If 20 females between 45 and 
54 are randomly selected, what is the probability that one-half of them are overweight? 


Ans. Let X represent the number of women in the 20 who are overweight. Then, X has a binomial 
distribution with n = 20 and p = .42. The probability P(X = 10) is given as follows: 


20! 10 10 
POX = 10) = =~ 42)! ( 58)” = 1359 


10!1 


TABLES OF THE BINOMIAL DISTRIBUTION 


5.17 Sixty percent of teenagers who drink alcohol do so because of peer pressure. Use the table of 
binomial probabilities to find the probability that in a sample of 15 teenagers who drink, 5 or 
fewer do so because of peer pressure. 


Ans. Table 5.16 shows the portion of the table needed to compute P(X ¢ 5). 


P(X S$ 5) = .0000 + .0000 + .0003 + .0016 + .0074 + .0212 = .0305 


Table 5.16 


5.18 A domestic homicide is one in which the victim and the killer are relatives or involved in a 
relationship. Suppose 40% of all murders are domestic homicides. A criminal justice study 
randomly selects 10 murder cases for investigation. Use the table of binomial probabilities to 
find the probability that between one and four inclusive of the murder cases will be domestic 
homicide cases. 


Ans. Table 5.17 shows the portion of the table needed to compute P(1 < X < 4). 
PC S$ X $4) = .0403 + 1209 + .2150 + .2508 = .6270 


Table 5.17 
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MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE 


5.19 Seventy-five percent of employed women say their income is essential to support their family. 
Let X be the number in a sample of 200 employed women who will say their income is 
essential to support their family. What is the mean and standard deviation of X? 


Ans. X is a binomial random variable with n = 200 and p = .75. The mean is = np = 200 x .75 = 150, 
and the standard deviation is 0 = Jnpq = ¥37.5 = 6.12. 


§.20 A binomial distribution has a mean equal to 8 and a standard deviation equal to 2. Find the 
values for n and p. 


Ans. The following equations must hold: 8 = np and 4 = npgq. Substituting 8 for np in the second 
equation gives 4 = 8q, which gives q = .5. Since p+ q= 1, p= 1 — .5=.5. Substituting .5 for p in 
the first equation gives n(.5) = 8, and it follows that n = 16. 


POISSON RANDOM VARIABLE 


5.21 Consider the number of customers arriving at the Grover street branch of Industrial and 
Federal Savings and Loan during a one-hour interval. What assumptions concerning the 
arrivals of customers are necessary in order that X, the number of customers arriving 1n a one 
hour interval, be a Poisson random variable? 


Ans. The assumptions necessary are that the arrivals be random and independent of one another. 


5.22 Why are the arrival of patients at a physician’s office and the arrival of commercial airplanes 
at an airport not Poisson random variables? 


Ans. Because of appointment times to see a physician, the arrivals are not random. Because of 
scheduled arrivals of commercial airplanes, the arrivals are not random. 


POISSON PROBABILITY FORMULA 


5.23 The mean number of patients arriving at the emergency room of University Hospital on 
Saturday nights between 10:00 and 12:00 is 6.5. Assuming that the patients arrive randomly 
and independently, what is the probability that on a given Saturday night, 5 or fewer patients 
atrive at the emergency room between 10:00 and 12:00? 


Ans. Let X represent the number of patients to arrive at the emergency room of University Hospital 
x ,-65 


e 


between 10:00 and 12:00. The probability formula for X is P(x) = The probability of the 


x! 
event that X <5 is P(O) + P(1) + P(2) + P(3) + P(4) + P(S). Each of these probabilities contain the 
common term e~*, which may be factored out to give the following as the probability of the event 


of interest. 


0 l 2 3 4 5 
oes os) 


P(X <5)=e 
0! i! 2! 3! 4! 


P(X <5) = .OOIS(1 + 6.5 + 21.125 + 45.7708 + 74.7760 + 96.6809) = .369 
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5.24 
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The solution using Minitab is as follows. 


MTB > cdf 5; 

SUBC > Poisson 6.5. 
Cumulative Distribution Function 
Poisson with mu = 6.50000 


Xx P(X <= x) 
5.00 0.3690 


When working problems involving the Poisson random variable, it is important to remember 
that the interval for the mean number of occurrences and the interval for X must be equal. If 
they are not, the mean should be redefined to make them equal. This problem illustrates this 
important point. 


The mean number of patients arriving at the emergency room of University Hospital on 
Saturday nights between 10:00 and 12:00 1s 6.5. Assuming that the patients arrive randomly 
and independently, what is the probability that on a given Saturday night, 2 or fewer patients 
arrive at the emergency room between 11:00 and 12:00? 


Ans. Let X be the number of patients to arrive at the emergency room of University Hospital on 
6.5 
Saturday nights between [1:00 and 12:00. The mean for X is — = 3.25, and the event of interest 
2 
is X $2. 
ae ane Aa Le ie 


PX 2) => 
0! ul pa 


HYPERGEOMETRIC RANDOM VARIABLE 


5.25 


5.26 


A box contains 10 red marbles and 90 blue marbles. Five marbles are selected randomly from 
the 100 in the box. Let X be the number of blue marbles in the five selected marbles. Identify 
the values for N, k, N—k, and n in the hypergeometric distribution which corresponds to X. 


Ans. The total number of marbles is N = 100. The number of blue marbles (successes) is k = 90. The 
number of red marbles (failures) is N-— k = 10. The sample size is n= 5. 


In problem 5.25, consider the event X = 2. This is the event that five marbles are selected and 2 

are blue and 3 are red. 

(a) How many ways may 5 marbles be selected from 100 marbles? 

(b) How many ways may 2 blue marbles be selected from 90 blue marbles? 

(c) How many ways may 3 red marbles be selected from 10 red marbles? 

(d) How many ways may 3 red marbles be selected from 10 red marbles and 2 blue marbles be 
selected from 90 blue marbles? 


100. 90 10. 
Ans. (a) = 75,287,520  (b) =4005 (c) = 120 
$5 2 3, 
(d) 120 x 4,005 = 480,600 
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HYPERGEOMETRIC PROBABILITY FORMULA 


§.27 Computec, Inc. manufactures personal computers. There are 40 employees at the Omaha plant 
and 15 employees at the Lincoln plant. Five employees of Computec. Inc. are randomly 
selected to fill out a benefits questionnaire. 

(a) What is the probability that none of the five selected are from the Lincoln plant? 
(b) What is the probability that all five of the selected employees are from the Lincoln plant? 


bles win 
OJ 5) — 1x 658,008 _ S}/\ 0} — 3,003x1 


= =.189 (b) = 
& 3,478,761 | _ 3,478,761 


5 5 


Ans. (a) = .000863 


5.28 What is the probability of getting 5 face cards when 5 cards are selected from a deck of 52? 


Ans. A deck of cards consists of 12 face cards and 40 nonface cards. Let X represent the number of 
face cards in the 5 cards selected. The event in which we are interested is X =5. The probability is 


seme 


= = .000305 
& 2,598,960) 


P(X =5)= 


5 


Supplementary Problems 


RANDOM VARIABLE 


5.29 A taste test is conducted involving 35 individuals. Random variable X 1s the number in the 35 who prefer 
a locally produced nonalcoholic beer to a national brand. What are the possible values for X? 


Ans. The whole numbers 0 through 35 


5.30 A psychological experiment was conducted in which the time to traverse a maze was recorded for each of 
five dogs. The times were 4, 6, 8. 9, and 12 minutes. Two of the times were randomly selected and the 
difference X = largest of the pair — smallest of the pair was recorded. Give all possible pairs of possible 
selections, and the value of X for each outcome. 


Ans. See Table 5.18. 
Table 5.18 


Value of X 


ho 


4 
5 
8 
2 
3 
6 
! 
4 
3 
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DISCRETE RANDOM VARIABLE 


5.31 


5.32 


A die is tossed until the face 6 turns up. Let X be the number of tosses needed until the face 6 first turns 
up. Give the possible values for the variable X. 


Ans. The possible values are the positive integers. That is, the possible values are 1,2, 3,... . 


Identify the discrete random variables in parts (a) through (e). 

(a) The number of arrests during a 10-day period during which the police apply a Zero-tolerance 
strategy 

(b) The time workers have spent with their current employer 

(c) The number of nurse practitioners per state 

(d) The career lifetime of major league baseball players 

(e) The number of executions of death row inmates per year in the U.S. 


Ans. (a), (c), and (e) 


CONTINUOUS RANDOM VARIABLE 


5.33 


5.34 


Identify the continuous random variables in the following list. 
(a) Weight of individuals in kg 

(6) Serum cholesterol level in mg/dl 

(c) Length of intravenous therapy in hours 

(d) Body mass index in kg/in’ 

(e) Cardiac output in liters/minute 


Ans. All five are continuous. 
What is the primary difference between a discrete random variable and a continuous random variable? 


Ans. There are values between the possible values of a discrete random variable which are not possible 
values for the random variable. This is not generally true for a continuous random variable. 


PROBABILITY DISTRIBUTION 


5.35 Suppose Table 5.19 gives the number in thousands of students in grades 9 through 12 for public schools 


in the United States. Let X represent the grade level. Give the probability distribution for X. 


Table 5.19 


| Grade fa 
3,525 3,475 3,050 2,950 


Ans. The distribution is given in Table 5.20. 


Table 5.20 


rox | 9 10 T 12 
271 .267—s(«i« 245 «iT 


5.36 Which of the following are probability distributions? For those which are not, tell why they are not. 


(a) 
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(b) 


(c) 


(d) P(x) =.2x’?,x=1.2 


Ans. Parts (b) and (d) are probability distributions. Part (a) is not because 2 P(x) = 1.2, Part (c) is not 
P(50) < 0. 


MEAN OF A DISCRETE RANDOM VARIABLE 


5.37 The number of personal computers per household in the United States has the probability distribution 
shown in Table 5.21. Find the mean number of personal computers per household. 


Table 5.21 
ee ae eee a 
30. 55S 


Ans. w=0.85 


5.38 Based ona large survey, the distribution in Table 5.22 was found for the number of pounds individuals 
desired to lose. Find the mean number of pounds they desired to lose. 


Table 5.22 
|___ pounds | _ 0 5 15 25 50 
33 2510 


Ans, w= 13.95 


STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE 


5.39 Find the standard deviation of the number of personal computers per household tor the distribution given 
in Table 5.21. 


Ans. 6 = 0.65 
§.40 Find the standard deviation of the number of pounds desired to be lost. 


Ans. GO = 15.55 


BINOMIAL RANDOM VARIABLE 


5.41 Thirty percent of the trees in a national forest are infested with a parasite. Fifty trees are randomly 
selected from this forest and X is defined to equal the number of trees in the 50 sampled that are infested 
with the parasite. The infestation is uniformly spread throughout the forest. Identify the values for n, p, 
and q. 


Ans. n=50, p= .30, and q = .70 
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5.42 Suppose in problem 5.41 that we define Y to be the number of trees in the 50 sampled that are not 
infested with the parasite. Then Y is a binomial random variable. 
(a) What are the values of n, p, and q for Y? 
(b) The event X = 20 is equivalent to the event that Y = a. Find the value for a. 


Ans. (a) n=50, p=.70, and q = .30 
(b) a= 30 


BINOMIAL PROBABILITY FORMULA 


5.43 The Dallas-Fort Worth Airport claims that 85% of their flights are on time. If the claim 1s correct, what is 
the probability that in a sample of 20 flights at the Dallas-Fort Worth Airport that 15 or more of the 
sample flights are on ime? 


Ans. .9327 


5.44 A psychological study involving the troops in the Bosnia peacekeeping force was conducted. If 12 
percent of the 21,496 troops are females, what is the probability that in a sample of 50 randomly selected 
individuals that five or fewer are female? 


Ans.  .4353 


TABLES OF THE BINOMIAL DISTRIBUTION 


5.45 There are approximately 3,000 inmates on death row. Forty percent of the death row inmates are African- 
American. Twenty of the death row inmates are randomly selected for a sociological study. Use the table 
of binomial probabilities to find the probability that most of the selected inmates are African-American. 


Ans. .1275 


5.46 Ten percent of the Rentwheels car-rental fleet are equipped with cellular phones. If five of the cars are 
randomly sclected, what is the probability that none are equipped with a cellular phone? 


Ans. .5905 


MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE 


5.47 It is conjectured that 60% of the deaths from mclanoma can be prevented by a skin self-exam. If this 
conjecture is correct, how many of the 7,000 deaths due to this skin cancer would be prevented per year 
on the average? What ts the standard deviation associated with the number of deaths prevented? 


Ans. 4,200 41 
5.48 Fifteen percent of the machinery and equipment at businesses is more than 10 years old. In a randomly 
selected sample of 35 businesses, how many would you expect to have machinery or equipment that is 


more than 10 years old? What standard deviation Is associated with this expected number? 


Ans. 5.25 2.11 
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POISSON RANDOM VARIABLE 


5.49 


5.50 


Give four examples of a Poisson random variable. 


Ans. 1. The number of telephone calls received per hour by an office 

2. The number of keyboard errors per page made by an individual using a word processor 
3. The number of bacteria in a given culture 
4, 


The number of imperfections per yard in a rol} of fabric 


What are the potential values of a Poisson random variable? 


Ans. 0O,1,2,... 


POISSON PROBABILITY FORMULA 


5.51 


5.52 


Suppose the mean number of earthquakes per year is 13. What is the probability of 20 or more 
earthquakes in a given year? 


Ans. .0427 


The number of plankton in a liter of lake water has a mean value of 7. Whai is the probability that the 
number of plankton in a given liter is within one standard deviation of the mean? 


Ans. .6575 


HYPERGEOMETRIC RANDOM VARIABLE 


5.53 


5.54 


Since 1977 twenty-four states have executed at least one death row inmate. In a study concerning capital 
punishment, ten of the fifty states are randomly selected. Let X represent the number of states in the ten 
that have executed at least one death row inmate. Identify success and failure. What are the values of N, 
k, N—k, and n. 


Ans. Success is that at least one death row inmate has been executed since 1977. Failure is that no 
execution has occurred in the state since 1977. N= 50, k = 24, N—k = 26. and n= 10. 


If success and failure are interchanged in problem 5.53, how is X changed and what are the values of N, 
k, N—k, and n for X? 


Ans. Xs the number of states in the ten that have executed no one since 1977. N = 50, k = 26,N-—k= 
24, and n= 10. 


HYPERGEOMETRIC PROBABILITY FORMULA 


5.55 


Thirty diabetics have volunteered for a medical study. Ten of the diabetics have high blood pressure. Five 
are selected for a preliminary screening for the study. What is the probability that none of the five 
selected have high blood pressure if the selection is done randomly? 


Ans. .1088 


A box of manufactured products contains 20 items. Three of the items are defective. Let X represent the 
number of defectives in three randomly selected from the box. Give the probability distribution for X and 


nk 
find the mean value for X using the probability distribution. Show that the mean is also given by p = —. 
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Ans 
ae aa BE 1 2 3 
596491 357895 044737 000877 


b= .45 


Chapter 6 


Continuous Random Variables and 
Their Probability Distributions 


UNIFORM PROBABILITY DISTRIBUTION 


A continuous random variable is a random variable capable of assuming all the values in an 
interval or several intervals of real numbers. Because of the uncountable number of possible values, 
it is not possible to list all the values and their probabilities for a continuous random variable in a 
table as is true with a discrete random variable. The probability distribution for a continuous random 
variable is represented as the area under a curve called the probability density function, abbreviated 
pdf. A pdf is characterized by the following two basic properties: The graph of the pdf is never below 
the x axts and the total area under the pdf always equals 1. 

The probability density function shown in Fig. 6-1 1s a uniform probability distribution. This 
pdf represents the distribution of flight times between Omaha, Nebraska, and Memphis, Tennessee. 
The flight time is represented by the letter X. The graph shows that the flight times range from 90 to 
100 minutes. The distance from the x axis to the graph remains constant at 0.10 and since the area of 
a rectangle is given by the length times the width, the area under the pdf is 10 x 0.10 = 1. Note that 
this pdf has the two basic properties given above. The graph of the pdf is never below the x axis and 
the total area under the pdf is equal to I. 


f(x) 


0.10 aS 


90) 100 
Fig. 6-1 
The representation of Fig. 6-1 by an equation is given as follows. 
faye fe 90 < x < 100 


OQ elsewhere 


In general, if a random variable X is uniformly distributed over the interval from a to b, then the 
pdf is given by formula (6. /). 


a<x<b 
f(x) = 4 (b-a) (6.1) 


O elsewhere 
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EXAMPLE 6.1 The probability that a flight takes between 92 and 97 minutes is represented as P(92 < X < 97) 
and is equal to the shaded area shown in Fig. 6-2. The rectangular shaded area has a width equal to 5 and a 


length equal to .I| and the area 1s equal to 5 x .1 = .5. That is, 50 percent of the flights between Omaha and 
Memphis will take between 92 and 97 minutes. 
f(x) 


90 92 97 100 


Fig. 6-2 


MEAN AND STANDARD DEVIATION FOR THE UNIFORM 
PROBABILITY DISTRIBUTION 


The mean value for a random variable having a uniform probability distribution over the interval 
from a to b ts given by 


at+b 
= 6.2 
Ul 5 (0.2) 
The variance for a uniform random variable is given by 
Sand 
o - (b~a)” (6.3) 
2 


EXAMPLE 6.2 The weights of 10-pound bags of potatoes packaged by Idaho Farms Inc. are uniformly 
distributed between 9.75 pounds and 10.75 pounds. The distribution of weights for these bags is shown in Fig. 
6-3. 

f(x) 


1.00 


9.75 10.75 
Fig. 6-3 


atb = 9.754 10.75 
Using formula (6.2), we sce that the mean weight per bag is p = = =a = 10.25 pounds and using 


2 
eee (ba)? | 
formula (6.3), the standard deviation is 6 = age = ,{— = 0.29. If X represents the weight per bag, then 
12 
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P(X > 10.0) corresponds to the proportion of bags that weigh more than 10 pounds. This probability is shown in 
Fig. 6-4. The .haded rectangle has dimensions | and 0.75 and the area 1s | x 0.75 = 0.75. Seventy-five percent 
of the bags weigh more than 10 pounds. To help ensure consumer satisfaction, Idaho Farms instructs their 
employees to try and never underfill if possible. This is reflected in the average weight of 10.25 pounds per bag. 


f(x) 


1.00 


9.75 10.00 10.75 
Fig. 6-4 
The probability associated with a single value for a continuous random variable is always equal 
to zero, since there is no area associated with a single point. That is, the probability that X = a is 
given by 
P(X =a) =0 (6.4) 
NORMA.L PROBABILITY DISTRIBUTION 
The most important and widely used of all continuous distributions is the normal probability 


distribution. Figure 6-5 shows the pdf for a normal distribution having mean {t and standard 
deviation ©. 


f(x) 


Ll X 


Fig. 6-5 


Table 6.1 gives some of the main properties of the normal curve shown in Fig. 6-5. Figure 6-6 
shows two different normal curves with the same mean but different standard deviations. The larger 
the standard deviation, the more disperse are the values about the mean. 
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Table 6.1 
Properties of the Normal Probability Distribution 


. The total area under the normal curve 1s equal to one. 

. The curve is symmetric about pt and the area under the curve on each side of the 
mean equals 0.5. 

. The tails of the curve extend indefinitely. 

. Each pair of values for p and o determine a different normal curve. 

. The highest point on a normal curve occurs at the mean. 

. The mean, median, and mode are all equal for a normal curve. 


. The mean locates the center of the curve and can be any real number. negative, 
positive, or zero. 

. The standard deviation is positive, and determines the shape of the normal curve. 
The larger the standard deviation, the wider and flatter the curve. 

. 68.26% of the area under the curve is within | standard deviation of the mean, 
95.44% of the area is within 2 standard deviations, and 99.72% of the area is within 
3 standard deviations of the mean. 


f(x) 


Fig. 6-6 


Figure 6-7 shows two normal curves having equal standard deviations but different means. The 
normal curve with mean equal to 66 represents the distribution of adult female heights and the 
normal curve with mean equal to 70 represents the distribution of adult male heights. 


f(x) 


66 70 


Fig. 6-7 


The equation of the pdf of a normal curve having mean [ and standard deviation © is given in 
formula (6.5). 
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f(x) = eet Sa (6.5) 


V2n0 


The two constants in the formula, e and 1, are important constants in mathematics. The constant e is 
equal to 2.718281828 . . . and the constant 7 is equal to 3.141592654 ... . The equation of the pdf 
for the normal curve is not used a great deal in practice and is given here for the sake of 
completeness. 


STANDARD NORMAL DISTRIBUTION 


The standard normal distribution is the normal distribution having mean equal to 0 and standard 
deviation equal to |. The letter Z is used to represent the standard normal random variable. The 
standard normal curve is shown in Fig. 6-8. The curve is centered at the mean, 0, and the z-axis 1s 
labeled in standard deviations above and below the mean. 


f(z) 


T T ara T T Tv r 2 


3 —2 -I 0 | 2 
Fig. 6-8 


Appendix 2 contains the standard normal distribution table. This table gives areas under the 
standard normal curve for the variable Z ranging from 0 to a positive number z. Some examples will 
now be given to illustrate how to use this table. 


EXAMPLE 6.3 Table 6.2 illustrates how to use the standard normal distribution table to find the area under 
the standard normal curve between z = 0 and z = 1.65. Figure 6-9 shows the corresponding area as the shaded 
region under the curve. The value 1.65 may be written as 1.6 + .05, and by locating 1.6 under the column 
labeled z and then moving to the right of 1.6 until you come under the .05 column you find the area .4505. This 
is the area shown in Fig. 6-9. We express this area as P(O < Z < 1.65) = .4505. 


Table 6.2 


ae aa 
Son! 0398 0438 
0793 .0832 
ieee 


| 1.6 | 4452 .4463 
RE (re ee eee eee) 
a ae a eee eee 
ae eee eer ee ee ee ee 
4987 4987 
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f(z) 


0 1.65 
Fig. 6-9 


EXAMPLE 6.4 The area under the standard normal curve between 7 = —1.65 and z = O is represented as 
P(-1.65 < Z <Q) and ts shown in Fig. 6-10. By symmetry, the following probabilities are equal. 


P(-1.65 <Z<0) = P(0< Z < 1.65) 


From Example 6.3, we know that P(O < Z < 1.65) = .4505 and therefore. P(-1.65 < Z < 0) = .4505. 
f(z) 


-1.65 0 
Fig. 6-10 


EXAMPLE 6.5 The area under the standard normal curve between 7 = —1.65 and 7 = 1.65 is represented by 
P(-1.65 < Z< 1.65) and is shown in Fig. 6-11. The probability P(-1.65 < Z < 1.65) is expressible as 


P(-1.65 < Z < 1.65) = P(-1.65 < Z <0) + P(O< Z < 1,65) 


The probabilities on the right side of the above equation are given in Examples 6.3 and 6.4, and their sum is 
equal to 0.9010. Therefore. P(-1.65 < Z < 1.65) = .9010. 


f(z) 


1.65 1.65 
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EXAMPLE 6.6 The probability of the event Z < 1.96 is represented by P(Z < 1.96) and is shown in Fig. 6-12. 
f(z) 


0 1.96 
Fig. 6-12 


The area shown in Fig. 6-12 is partitioned into two parts as shown in Fig. 6-13. The darker of the two areas is 
equal to P(Z < 0) = .5, since it is one-half of the total area. The lighter of the two areas is found tn the standard 
normal distribution table to be .4750. The sum of the two areas is .5 + .4750 = .9750. Summarizing, 


P(Z < 1.96) = P(Z< 0) + POO< Z < 1.96) = .5 + .4750 = .9750 
f(z) 


0 1.96 
Fig. 6-13 


The probability in Example 6.6 can also be found by using CDF of Minitab as follows: 


MTB > cdf 1.96; 

SUBC > normal 0 1. 

Cumulative Distribution Function 

Normal with mean = 0 and standard deviation = |.00000 


X P(X <= x) 
1.9600 0.9750 ; 
{(z 


Z 
0 1.96 
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In order to find the probability P(Z> 1.96), shown in Fig. 6-14, we use the complement of the event Z> 1.96 
and the result for P(Z < 1.96) given above as follows. 


P(Z > 1.96) = 1 - P(Z < 1.96) = I - 9750 = .0250 


STANDARDIZING A NORMAL DISTRIBUTION 


In order to find areas under a normal distribution having mean tt and standard deviation o, the 
normal distribution must be standardized. A normal random variable X having mean wu and standard 
deviation 0 is converted or transformed to a standard normal random variable by the formula given 
in (6.6). 


Z= (6.6) 


EXAMPLE 6.7 Figure 6-15 shows the distribution of adult male weights for a particular age group. The 
weights, X, are normally distributed with mean 170 pounds and standard deviation equal to 1S pounds. The 
x-p 215-170 


weighs 215 pounds is 3 standard deviations above average. The standardized value for 170 pounds is zero, and 
the standardized value for 125 pounds is —3. 


f(x) 


125 170 215 


Fig. 6-15 


APPLICATIONS OF THE NORMAL DISTRIBUTION 


The fact that many real-world phenomena are normally distributed leads to numerous appli- 
cations of the normal distribution. Applications of the normal distribution usually involve finding 
areas under a normal curve. To find the area between two values of x for a normal distribution, first 
convert both values of x to their respective z values. Then find the area under the standard normal 
curve between those two z values. The area between the two z values gives the area between the 
corresponding x values. 


EXAMPLE 6.8 In a study involving stress-induced blood pressure, volunteers played a computer game called 
the color-word interference task. The game was set so that everyone made errors about 17% of the time. The 
average increase in systolic blood pressure was 10 points of systolic pressure, and the standard deviation was 3 
points. The percent experiencing an increase of 16 points or more is found by evaluating P(X > 16) and 
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multiplying by 100. The probability is shown as the shaded area in Fig. 6-16. The z value corresponding to x = 
16-10 


16 is z= = 2. The area shown in Fig. 6-16 is the same as the area shown in Fig. 6-17. 


f(x) 


10 16 
Fig. 6-16 
The equality of the areas in Figures 6-16 and 6-17 is expressible in terms of probability as follows: 
P(X > 16) = P(Z> 2) 


Using the standard normal distribution table, we find that P(Z > 2) = .5 — .4772 = .0228. The percent 
experiencing an increase of 16 systolic points or more is 2.28%. 


f(z) 


0 2 
Fig. 6-17 


EXAMPLE 6.9 The time between release from prison and conviction for another crime for individuals under 
40 is normally distributed with a mean equal to 30 months and a standard deviation equal to 6 months. The 
percentage of these individuals convicted for another crime within two years of their release from prison is 
represented as P(X < 24) times 100. The probability 1s shown as the shaded area in Fig. 6-18. The event X < 24 
X— 30 ie 24 — 30 


is equivalent to Z = = -—1. The probability P(Z <-—1) is shown as the shaded area in Fig. 6-19. 


f(x) 
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The shaded area shown in Fig. 6-19 is P(Z < -1), and by symmetry P(Z < ~-1) = P(Z > 1). The probability 
P(Z> 1) is found using the standard normal distribution table as .5 — .3413 = .1587. That is P(X < 24) = 
P(Z <—-1) = .1587 or 15.87% commit a crime within two years of their release. 


f(z) 


Fig. 6-19 


EXAMPLE 6.10 A study determined that the difference between the price quoted to women and men for used 
cars is normally distributed with mean $400 and standard deviation $50. To clarify, let X be the amount quoted 
to a woman minus the amount quoted to a man for a given used car. Then, the population distribution for X is 
normal with p = $400 and o = $50. The percent of the time that quotes for used cars are $275 to $500 more for 
women than men is given by P(275 < X < 500) times 100. The probability P(275 < X < 500) is shown as the 


275 - 400 
shaded area in Fig. 6-20. The Z value corresponding to X = 275 is z = ————— = -2,5 and the Z value 


50 
500 - 400 - er 
corresponding to X = 500 is z = aa = 2.0. The probability P(-2.5 < Z < 2.0) is shown in Fig. 6-21. 


Fig. 6-20 


The probability P(-2.5 < Z < 2.0) is found using the standard normal distribution table. The probability is 
expressed as P(-2.5 < Z < 2.0) = P(-2.5 < Z <0) + P(O< Z< 2.0). By symmetry, P(—2.5<Z<0) = P(O<Z<2.5) 
= .4938 and P(0 < Z < 2.0) = .4772. Therefore, P(-2.5 < Z < 2.0) = .4938 + .4772 = .971. P(275 < X < 500) = 
P(-2.5 < Z < 2.0) = .971, or 97.1% of the time the quotes for used cars will be between $275 and $500 more for 
women than for men. 


CHAP. 6 
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DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE 


NORMAL CURVE IS KNOWN 


123 


In many applications, we are concerned with finding the z value or the x value when the area 
under a normal curve is known. The following examples illustrate the techniques for solving these 


type problems. 


EXAMPLE 6.11 Find the positive value z such that the area under the standard normal curve between 0 and z 
is .4951. Figure 6-22 shows the area and the location of z on the horizontal axis. Table 6.3 gives the portion of 


the standard normal distribution table needed to find the z value. 


f(z) 


Fig. 6-22 


We search the interior of the table until we find .4951, the area we are given. This area is shown in bold print in 
Table 6.3. By going to the beginning of the row and top of the column in which .4951 resides, we see that the 


value for z is 2.58. 


Table 6.3 
| oz | .00 01 cc .05 z 08 
0000. .0040. 0199, .0319 


0398 0438. 


0793 0832 
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To find the value of negative z such that the arca under the standard normal curve between 0 and 7 equals 
.4951, we find the value for positive z as above and then take z to be -2.58. 


EXAMPLE 6.12 The mean number of passengers that fly per day is equal to 1.75 million and the standard 
deviation is 0.25 million per day. If the number of passengers flying per day is normally distributed, the 
distribution and ninety-fifth percentile, Pos. is as shown in Fig. 6-23. The shaded area is equal to 0.95. We have 
a known area and need to find Pos. a value of X. Since we must always use the standard normal tables to solve 
problems involving any normal curve, we first draw a standard normal curve corresponding to Fig. 6-23. This is 
shown in Fig. 6-24. 


f(x) 


Fig. 6-23 


Using the technique shown in Example 6.11, we find the area .4500 in the interior of the standard normal 
distribution table and find the value of z to equal 1.645. There is 95% of the area under the standard normal 
curve to the left of 1.645 and there is 95% of the area under the curve in Fig. 6-23 to the left of Pos. Therefore if 


Pos — 1.75 
Pos is standardized, the standardized value must equal 1.645. That is, = | 1.645 and solving for Pos we 


find Pos = 1.75 + .25 x 1.645 = 2.16 million. 


f(z) 


Fig. 6-24 


The value of z in Fig. 6-24 can also be found by using INVCDF of Minitab as follows. 


MTB > invcdf .95; 

SUBC > normal 0 1. 

Inverse Cumulative Distribution Function 

Normal with mean = 0 and standard deviation = | .00000 


P(X <= x) X 
0.9500 1.6449 
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NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 


Figure 6-25 shows the binomial distribution for X, the number of heads to occur in 10 tosses of a 
coin. This distribution is shown as the shaded area under the histogram-shaped figure. Superimposed 
upon this binomial distribution is a normal curve. The mean of the binomial distribution is p = np = 


10 x .5 = 5 and the standard deviation of the binomial distribution is 0 = pq =VI0x 5x 5 = 


J¥25 = 1.58. The normal curve, which is shown, has the same mean and standard deviation as the 
binomial distribution. When a normal curve is fit to a binomial distribution in this manner, this is 
called the normal approximation to the binomial distribution. The shaded area under the binomial 
distribution is equal to one and so is the total area under the normal curve. 

The normal approximation to the binomial distribution 1s appropriate whenever np 2 5 and 
ng 25. 


Fig. 6-25 


EXAMPLE 6.13 Using the table of binomial probabilities, the probability of 4 to 6 heads inclusive is as 
follows: P(4 < X <6) = .2051 + .2461 4+ .2051 = .6563. This 1s the shaded area shown in Fig. 6-26. 


P(x) 


Fig. 6-26 


The normal approximation to this area is shown in Fig. 6-27. To account for the area of all three rectangles, note 
that X must go from 3,5 to 6.5. The area under the normal curve for X between 3.5 and 6.5 is found by 


3.5-5 65-5 
determining the area under the standard normal curve for Z between z = aes = —.95 and z = ——— = .95. 
1 ] 


This area is 2 x .3289 = .6578. Note that the approximation, .6578, is extremely close to the exact answer, 
6563. 
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f(x) 


0 3.5 6.5 10 
Fig. 6-27 


EXAMPLE 6.14 To aid in the carly detection of breast cancer women are urged to perform a self-exam 
monthly. Thirty-eight percent of American women perform this exam monthly. In a sample of 315 women, the 
probability that 100 or fewer perform the exam monthly is P(X < 100). Even Minitab has difficulty performing 
this extremely difficult computation involving binomial probabilities. However, this probability is quite easy to 
approximate using the normal approximation. The mean value for X is Lf = np = 315 x .38 = 119.7 and the 


standard deviation is 6 = Jnpq = V315 x 38 x 62 = 8.61. A normal curve is constructed with mean 119.7 


and standard deviation 8.61. In order to cover all the area associated with the rectangle at X = 100 and all those 
less than 100, the normal curve area associated with x less than 100.5 is found. Figure 6-28 shows the normal 
curve area we need to find. 


f(x) 


100.5 119.7 
Fig. 6-28 


The corresponding area under the standard normal curve is shown in Fig. 6-29, The area under the standard 
normal curve for Z < -2.23 is .5000 - .4871 = .0129. The probability 1s extremely small that 100 or fewer in the 
315 will perform the breast examination each month. 


f(z) 


—2.23 0 


Fig. 6-29 
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EXPONENTIAL PROBABILITY DISTRIBUTION 


The exponential probability distribution is a continuous probability distribution that is useful in 
describing the time it takes to complete some task. The pdf for an exponential probability distribution 
is given by formula (6.7), where Lt is the mean of the probability distribution and e = 2.71828 to five 
decimal places. 


f(x) = “en forx 20 (6.7) 


The graph for the pdf of a typical exponential distribution ts shown in Fig. 6-30. 
f(x) 


x 


Fig. 6-30 
EXAMPLE 6.15 The exponential random variable can be used to describe the following characteristics: the 


time between logins on the internet, the time between arrests for convicted felons, the lifetimes of electronic 
devices, and the shelf life of fat free chips. 


PROBABILITIES FOR THE EXPONENTIAL PROBABILITY DISTRIBUTION 


A standardized table of probabilities does not exist for the exponential distribution and to find 
areas under the exponential distribution curve requires the use of calculus. However, formula (6.8) 
is useful in solving many problems involving the exponential distribution. 


P(X Sa)=1-e7%” (6.8) 
The area corresponding to the probability given in formula (6.8) is shown as the shaded area in Fig. 


6-31. 
f(x) 


Fig. 6-31 


128 CONTINUOUS RANDOM VARIABLES [CHAP. 6 


EXAMPLE 6.16 Suppose the time till death after infection with HIV, the AIDS virus, is exponentially 
distributed with mean equal to 8 years. If X represents the time till death after infection with HIV, then the 
percent who die within five years after infection with HIV is found by multiplying P(X < 5) by 100. The 
probability is found as follows: P(X < 5) = 1 - e ®° = | — 535 = .465. Using CDF of Minitab, we have the 
following as an alternative solution. 


MTB > cdf 5; 

SUBC > exponential 8. 
Cumulative Distribution Function 
Exponential with mean = 8.00000 


x P(X <= x) 
5.0000 0.4647 


To find the percent who live more than 10 years, we multiply P(X > 10) by 100. In order to utilize formula 
(6.8), we use the complementary rule for probabilities. This rule allows us to write P(X > 10) as follows: 


P(X > 10)=1-P(X< 10) =1-(1-e'*) =e!” = .287 


That is, 28.7% of the individuals live more than 10 years after infection. This probability is shown as the shaded 
area in Fig. 6-32. 


f(x) 


10 
Fig. 6-32 


To find the percent who live between 2 and 4 years after infection, we multiply P(2 < X < 4) by 100. To use 
formula (6.8) to find this probability, we express P(2 < X < 4) as follows: 


P(I2< X<4)=P(X <4)-P(X<2)=(l-e°)-(l-e *) =e —e = .172 


That is, 17.2% live between 2 and 4 years after infection. This probability is shown as the shaded area in Fig. 
6-33. 
f(x) 


Fig. 6-33 
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Solved Problems 


UNIFORM PROBABILITY DISTRIBUTION 


6.1 The price for a gallon of whole milk is uniformly distributed between $2.25 and $2.75 during 
July in the U.S. Give the equation and graph the pdf for X, the price per gallon of whole milk 
during July. Also determine the percent of stores that charge more than $2.70 per gallon. 


2 225<%<2.75 


bsie@hes . The graph is shown in Fig. 6-34. 


Ans. The equation of the pdf is f(x) = 


f(x) x) 


| | 4 
X X 
2.25 


2.25 2.75 2.70 2.75 
Fig. 6-34 Fig. 6-35 


The percent of stores charging a higher price than $2.70 is P(X > 2.70) times 100. The probability 
P(X > 2.70) is the shaded area in Fig. 6-35. This area is 2 x .0S = .10. Ten percent of all milk 
outlets sell a gallon of milk for more than $2.70. 


6.2 The time between release from prison and the commission of another crime is uniformly 
distributed between 0 and 5 years for a high-risk group. Give the equation and graph the pdf 
for X, the time between release and the commission of another crime for this group. What 
percent of this group will commit another crime within two years of their release from prison? 


2 O«<x<§ 


. The graph of the pdf is shown in Fig. 6-36. The 


Ans. The equation of the pdf is f(x) = 
0 elsewhere 


percent who commit another crime within two years is given by P(X < 2) times 100. This 
probability is shown as the shaded area in Fig. 6-37, and is equal to 2 x .2 = .4. Forty percent will 
commit another crime within two years. 


f(x) f(x) 


xX X 


Fig, 6-36 Fig. 6-37 
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MEAN AND STANDARD DEVIATION FOR THE UNIFORM 
PROBABILITY DISTRIBUTION 


6.3 


6.4 


Find the mean and standard deviation of the milk prices in problem 6.1. What percent of the 
prices are within one standard deviation of the mean? 


ae: .. “22ot 275 
Ans. The mean is given by p = ——~ = ————— = 2.50 and the standard deviation is given by 


- a)" 15 — 2.25) 
oa = erat = .144. The one standard deviation interval about the mean goes 


from 2.36 to 2.64 and the probability of the interval is (2.64 — 2.36) x 2 = .56. Fifty-six percent of 
the prices are within one standard deviation of the mean. 


Find the mean and standard deviation of the times between release from prison and the 
commission of another crime in problem 6.2. What percent of the times are within two 
standard deviations of the mean? 


eet, a+b O+5 ae (b — a)’ 
Ans. The mean ts given by p = ——— = —— = 2.50 and the standard deviation is given by aes 


2 
= oat = 1.44. A 2 standard deviation interval about the mean goes from —.38 to 5.38 and 


100% of the times are within 2 standard deviations of the mean. 


NORMAL PROBABILITY DISTRIBUTION 


6.5 


6.6 


The mean net worth of all Hispanic individuals aged 51-61 in the U.S. is $80,000, and the 
standard deviation of the net worths of such individuals is $20,000. If the net worths are 
normally distributed, what percent have net worths between: (a) $60,000 and $100,000; (6) 
$40,000 and $120,000; (c) $20,000 and $140,000? 


Ans. (a) 68.26% have net worths between $60,000 and $100,000. 
(b) 95.44% have net worths between $40,000 and $120,000. 
(c) 99.72% have net worths between $20,000 and $140,000. 


If the median amount of money that parents in the age group 51-6] gave a child tn the last 
year is $1,725 and the amount that parents in this age group give a child is normally 
distributed, what is the modal amount that parents in this age group give a child? 


Ans. Since the distribution is normally distributed, the mean, median, and mode are al! equal. 
Therefore, the modal amount is also $1,725. 


STANDARD NORMAL DISTRIBUTION 


6.7 


Express the areas shown in the following two standard normal curves as a probability 
statement and find the area of each one. 
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Kz) f(z) 


0 1.83 -1.87 1.87 


Ans. The area under the curve on the left is represented as P(O < Z < 1.83) and from the standard 
normal distribution table is equal to .4664. The area under the curve on the right is represented as 
P(-1.87 < Z < 1.87) and from the standard normal distribution table is 2 x .4693 = .9386. 


6.8 Represent the following probabilities as shaded areas under the standard normal curve and 
explain in words how you find the areas: (a) P(Z < -1.75); (6) P(Z < 2.15). 


Ans. The probability in part (a) 1s shown as the shaded area in Fig. 6-38 and the probability in part (b) 
is shown in Fig 6-39. 


To find the shaded area in Fig. 6-38, we note that P(Z < —1.75) = P(Z > 1.75) because of the 
symmetry of the normal curve. In addition, P(Z > 1.75) = .5 — P(O < Z < 1.75) since the total area 
to the right of 0 is .5. From the standard normal distribution table, P(Q < 7 < 1.75) ts equal to 
.4599. Therefore, P(Z < —1.75) = P(Z > 1.75) = .5 - .4599 = .0401. 


To find the shaded area in Fig. 6-39, we note that P(Z < 2.15) = P(Z < 0) + P(O< Z < 2.15). The 
probability P(Z < 0) = .5 because of the symmetry of the normal curve. From the standard normal 
distribution table, P(O < Z < 2.15) = .4842. Therefore, P(Z < 2.15) = .5 + .4842 = .9842. The 
solution to part (b) using Minitab ts as follows: 


MTB > cdf 2.15; 

SUBC > normal! 0 1. 

Cumulative Distribution Function 

Normal with mean = 0 and standard deviation = |.00000 


x P(X <= x) 
2.1500 0.9842 


f(z) 


X 
-1.75 


Fig. 6-38 
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STANDARDIZING A NORMAL DISTRIBUTION 


6.9 The distribution of complaints per week per !00,000 passengers for all airlines in the U.S. is 
normally distributed with }t = 4.5 and 0 = 0.8. Find the standardized values for the following 
observed values of the number of complaints per week per 100,000 passengers: (a) 6.3; (b) 2.5; 


(c) 4.5; (d) 8.0. 
x-H 63-45 
Ans. (a) The standardized value for 6.3 is found by z = —— = an = 2.25 
| . x-p 25-45 
(b) The standardized value for 2.5 is found by z = —— = ae = -2.50 
Go ; 
_ K-H 45-45 
(c) The standardized value for 4.5 is found by z = ——— = a = 0.00 
Go 
x-p 80-45 
(d) The standardized value for 8.0 is found by z = ——— = ———— = 4.38 


6.10 Personal injury awards are normally distributed with a mean equal to $62,000 and a standard 
deviation equal to $13,500. Find the amount of the award corresponding to the following 
standardized values: (a) 2.0; (b) -3.0; (c) 0.0; (d) 4.5. 


Ans, 


(a) The amount corresponding to standardized value 2.0 is x = p + z0 = 62,000 + 2.0 x 13,500 = 
89,000. 

(6) The amount corresponding to standardized value -3.0 is x = pL + zo = 62,000 — 3.0 x 13.500 = 
21,500. 

(c) The amount corresponding to standardized value 0.0 is x = pb + zo = 62,000 + 0.0 x 13,500 = 
62,000. 

(d) The amount corresponding to standardized value 4.5 is x =p + zo = 62,000 + 4.5 x 13,500 = 
122.750. 


APPLICATIONS OF THE NORMAL DISTRIBUTION 


6.11 In a sociological study concerning family life, it is found that the age at first marriage for men 
is normally distributed with a mean equal to 23.7 years and a standard deviation equal to 3.5 
years. Determine the percent of men for whom the age at first marriage is between 20 and 30 
years of age. If X represents the age at first marriage for men, draw a normal curve for X and 
Show the shaded area for P(20 < X < 30) as well as the corresponding area under the standard 
normal curve. 


Ans. 


The distribution for X is shown in Fig. 6-40. The shaded area represents P(20 < X < 30). The z 


20~ 23.7 
value corresponding to x = 20 is z= rae = —1.06 and the z value corresponding to x = 30 is 


30— 23.7 
oe a eae 


= 1.80. The area shown in Fig. 6-41 is equal to the area under the curve shown in 


Fig. 6-40, that is, P(20 < X < 30) = P(-1.06 < Z < 1.80). Utilizing the standard normal distribution 
table, P(-1.06 < Z < 1.80) = .3554 + .4641 = .8195. That ts, about 82%. of the first marriages for 
men occur for men between 20 and 30 years of age. 
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f(x) f(z) 


» 
at 


20 23.7 x 1.06 0 1.80 Z 


Fig. 6-40 Fig. 6-41 


6.12 The net worth of senior citizens is normally distributed with mean equal to $225,000 and 


standard deviation equal to $35,000. What percent of senior citizens have a net worth less than 
$300,000? 


Ans, Let X represent the net worth of senior citizens in thousands of dollars. The percent of senior 
citizens with a net worth less than $300,000 is found by multiplying P(X < 300) times 100. The 
probability P(X < 300) 1s shown in Fig. 6-42. The event X < 300 is equivalent to the event 

300-225 
Z< Frases = 2.14. The probability that Z < 2.14 is represented as the shaded area in Fig. 6-43. 


The probability that Z is less than 2.14 is found by adding P(O < Z < 2.14) to .5, which equals .5 + 
.4838 = .9838. We can conclude that 98.38% of the senior citizens have net worths less than 
$300,000. 


f(x) 


Fig. 6-42 Fig. 6-43 


DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE 
NORMAL CURVE IS KNOWN 


6.13 Find the value for positive number a such that P(-a < Z <a) = .95. 


Ans. The symmetry of the normal curve implies that P(O < Z < a) = 4750. This area is found in the 
interior of the standard normal distribution table and the z value corresponding to this area is 1.96. 
The area under the standard normal curve corresponding to -1.96 < Z < 1.96 is .9500. 
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6.14 U.S. divorce rates, by county, are normally distributed with mean value equal to 4.5 per 1000 
people and standard deviation equal to 1.3 per 1000 people. Find the third quartile for divorce 
rates by county per 1000 people. Draw graphs to illustrate your solution. 


Ans, The divorce rate is represented by X and its distribution is shown in Fig. 6-44. The third quartile is 
represented by Q;. The shaded area is equal to .7500. The third quartile for the standard normal 
distribution is .67 and is shown in Fig. 6-45. Summarizing, we have P(X < Q3) = P(Z < .67) = .75. 


5 
= .67 of Q3=45+13x .67= 


The standardized value of Q; must equal .67. That is, 


5.37. Seventy-five percent of the divorce rates are 5.37 or below. 


f(x) f(z) 


Q; | .67 
Fig. 6-44 Fig. 6-45 


NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 


6.15 Use Minitab to find the probability of the event that between 8 and 11 heads inclusive occur 
when a coin is flipped 20 times. Find the normal approximation to this probability, 


Ans. MTB > cdf 11; MTB > cdf 7; 
SUBC > binomial 20. .5. SUBC > binomial 20 .5. 
Cumulative Distribution Function Cumulative Distribution Function 
Binomial with n = 20 and p = 0.500 Binomial with n = 20 and p = 0.500 
X P(X <= x) x P(X <= x) 
11.00 0.7483 7.00 0.1316 


To find the probability of the event 8< X < 11 note that P(8 <X <11)=P(X < 11) - P(X <7). 
Therefore, P(8 < X $< 11) = .7483 - .1316 = .6167. 


The mean of the binomial distribution is p = np = 20 x .5 = 10, and the standard deviation is 6 = 
¥Opq = V5 = 2.24. A normal curve with mean equal to 10 and standard deviation 2.24 is fit to the 


binomial distribution and the area under this normal curve is found for X ranging between 7.5 and 
11.5. This area is shown in Fig. 6-46. The standardized value for x = 7.5 is z = —1.12 and the 
standardized value for x = 11.5 is z = .67. The area between z = -1.12 and z = .67 is shown in Fig. 
6-47. The area is now found using Minitab. 


MTB > cdf 11.5; MTB > cdf 7.5; 
SUBC > normal 10 2.24. SUBC > normal 10 2.24. 
Cumulative Distribution Function Cumulative Distribution Function 
Normal with mean = 10.00 and Normal with mean = 10.00 and 
standard deviation = 2.240 standard deviation = 2.240 

Xx P(X <= x) X P(X <= x) 


11.50 0.7485 7.50 0.1322 
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The area between 7.5 and 11.5 is the difference .7485 — .1322 = .6163. 


f(x) f(z) 


Fig. 6-46 Fig. 6-47 


6.16 In the United States, 80% of TV-owning homes also own a VCR. Use the normal approxi- 
mation to the binomial distribution to find the probability that 425 or more in a sample of 500 
TV-owning homes also own a VCR. 


Ans. If we let X represent the number of TV-owning homes which also own a VCR, then we are to find 
P(X 2 425). A normal curve having mean = np = 400 and standard deviation = Vnpq = 8.94 is fit 


to the binomial and we need to find the area under the normal curve to the right of 424.5. The 
solution using Minitab ts as follows. 


MTB > cdf 424.5; 

SUBC > normal 400 8.94. 

Cumulative Distribution Function 

Normal with mean = 400.00 and standard deviation = 8.94 
X P(X <= x) 

424.50 0.9969 


Since .9969 is the probability that X is less than 424.5, the probability that X exceeds 424.5 is 
1 - .9969 = .0031. 


EXPONENTIAL PROBABILITY DISTRIBUTION 


6.17 Which of the following statements best describes the exponential probability distribution? 
(a) The exponential probability distribution is skewed to the right. 
{b) The exponential probability distribution is skewed to the left. 
(c) The exponential probability distribution is mound shaped. 
Ans. (a) The exponential distribution is skewed to the right. 


6.18 What is the equation of the exponential pdf having mean equal to 5? 


Ans. The equation is {(x) = 2e-** for x 20. 


PROBABILITIES FOR THE EXPONENTIAL PROBABILITY DISTRIBUTION 


6.19 The lifetimes in years for a particular brand of cathode ray tube are exponentially distributed 
with a mean of 5 years. What percent of the tubes have lifetimes between 5 and 8 years? Draw 
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a graph of the pdf and shade the area which represents the probability of the event 5 < X < 8, 
where X represents the lifetimes. 


Ans. The shaded area in Fig. 6-48 represents P(5 < X < 8). The probability is found as follows: 
P(5 <X <8) = P(X < 8)- P(X <5) =(1-e'*) -(1 -e) =e! — e'* = 3679 — .2019 = .166. 


Approximately 16.6% of the tubes have lifetimes between 5 and 8 years. 


f(x) 


6.20 For the cathode ray tubes in problem 6.19, use Minitab to determine the percent of tubes that 


have lifetimes of less than 10 years. 


Ans. MTB > cdf 10; 
SUBC > exponential 5. 
Cumulative Distribution Function 
Exponential with mean = 5.00000 
X P(X <= x) 
10.00 0.8647 


The percentage of tubes having lifetimes less than 10 years is 86.5%. 


Supplementary Problems 


UNIFORM PROBABILITY DISTRIBUTION 


6.21 


6.22 


The RND function is a computer Janguage function which uniformly generates random numbers between 
O and 1. 

(a) What percent of the random numbers generated by RND are less than .35? 

(b) What percent of the random numbers generated by RND are between .20 and .55? 

(c) What percent of the random numbers generated by RND are either less than .14 or greater than.81? 


Ans. (a) 35%  (b) 35% ~— (c) 33% 


In a psychological study involving personality types and career selections, it 1s found that the time 
required to complete a task is uniformly distributed over the interval from 5.0 to 7.5 minutes. 

(a) What is the probability that the task is completed in less than 4 minutes? 

(b) What is the probability that the task is completed in 7.0 or more minutes? 

(c) What is the probability that it requires more than 10.0 minutes to complete the task? 


Ans. (a) 0 (b) .2 (c) 0 
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MEAN AND STANDARD DEVIATION FOR THE UNIFORM PROBABILITY DISTRIBUTION 

6.23, What is the mean and standard deviation of the random numbers generated by RND in problem 6.21? 
Ans. p=.50 6 = .29 

6.24 What percent of the distribution in problem 6.23 is included within one standard deviation of the mean? 


Ans. 57.6% 


NORMAL PROBABILITY DISTRIBUTION 


6.25 The heights of adult males are normally distributed with a mean equal to 70 inches and a standard 
deviation equal to 3 inches. Give the pdf for the normal curve describing this distribution. 


ar eter 2 
e TOV for all values of x 


Ans. f(x) = 32m 


6.26 A normal distribution has mean equal to a and standard deviation equal to b and c is a positive number. If 
it is known that P(X > a +c) =d, find the following probabilities: 
(a) P(X <a+c) (b) P(X <a-c) (c) Pla-c<X<a+tc) (d) P(O<X<a+c) 


Ans. (a) |-d (b) d (c) 1 -2d (d) .5-d 


STANDARD NORMAL DISTRIBUTION 


6.27 Find the following probabilities concerning the standard normal random variable Z. 
(a) P(O< Z< 2.13) (b) P(-1.45<Z< 2.10) (c) P(Z> 2.88) (d) P(Z > -2.01) 


Ans. (a) .4834 (b) .9086 (c) .0020 (d) .9778 

6.28 Find the probabilities of the following events involving the standard normal random variable Z. 
(a) Z>4.50 (b) -4.00 < Z < 5.50 (c) Z<-1.25 or Z > 2.35 
(dq) Z< 1.15 andZ>-1.15  (e) Z<-2.45andZ>1.11 (f) the complement of Z < 1.44 


Ans. (a) approx 0 (b) approx | (c) .1150 (d) .7498 (e) 0 (f) .0749 


STANDARDIZING A NORMAL DISTRIBUTION 


6.29 The marriage rate per 1,000 population per county has a normal distribution with mean 8.9 and standard 
deviation 1.7. Find the standardized values for the following marriage rates per 1,000 per county. 
(a) 6.5  (b) 8.8 (c) 12.5 (d) 13.5 


Ans. (a) -1.41 (6b) -0.06 (c) 2.12 (d) 2.71 

6.30 The hospital cost for individuals involved in accidents who do not wear seat belts is normally distributed 
with mean $7,500 and standard deviation $1,200. 
(a) Find the cost for an individual whose standardized value is 2.5. 


(6) Find the cost for an individual whose bill is 3 standard deviations below the average. 


Ans. (a) $10,500 (b) $3,900 
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APPLICATIONS OF THE NORMAL DISTRIBUTION 


6.31 The average TV-viewing time per week for children ages 2 to II is 22.5 hours and the standard deviation 
is 5.5 hours. Assuming the viewing times are normally distributed, find the following. 
(a) What percent of the children have viewing times less than 10 hours per week? 
(6) What percent of the children have viewing times between [5 and 25 hours per week? 
(c) What percent of the children have viewing times greater than 40 hours per week? 


Ans. (a) 1.16% (b) 58.67% (c) less than 0.1% 


6.32 The amount that airlines spend on food per passenger is normally distributed with mean $8.00 and 
standard deviation $2.00. 
(a) What percent spend less than $5.00 per passenger? 
(b) What percent spend between $6.00 and $10.00? 
(c) What percent spend more than $12.50? 


Ans. (a) 6.68% (b) 68.26% (c) 1.22% 


DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE NORMAL 
CURVE IS KNOWN 


6.33 Find the value of a in each of the following probability statements involving the standard normal variable 


Z: 
(a) PO<Z<a)=.4616 (b) P(Z <a) = .8980 (c) P(-a<Z<a)= .8612 
(d) P(Z< a) =.1894 (e) P(Z> a) = .1894 (f) P(Z = a) = 5000 


Ans. (a) 1.77 (b) 1.27 (c) 1.48 (da) -0.88 (e) 0.88 (f) no solutions 


6.34 The GMAT test is required for admission to most graduate programs in business. In a recent year, the 
GMAT test scores were normally distributed with mean value 550 and standard deviation 100. 
(a) Find the first quartile for the distribution of GMAT test scores. 
(b) Find the median for the distribution of GMAT test scores. 
(c) Find the ninety-fifth percentile for the distribution of GMAT test scores. 


Ans. (a) 483 (b) 550 (c) 715 


NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 


6.35 For which of the following binomial distributions is the normal approximation to the binomial 
distribution appropriate? 
(a) n=15,p=.2 (b) n=40,p=.! (c) n= 500, p = .05 (d) n=50,p=.3 


Ans. (c) and (d) 


6.36 Thirteen percent of students took a college remedial course in 1992~1993. Assuming this is still true, 
what is the probability that in 350 randomly selected students: 
(a) Less than 40 take a remedial course 
(6) Between 40 and S50, inclusive, take a remedial course 
(c) More than 55 take a remedial course 


Ans. (a) .1711 (8) 6141 = (c) .0559 
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EXPONENTIAL PROBABILITY DISTRIBUTION 


6.37 The random variable X has an exponential distribution with pdf f(x) = .le~'*, for x > 0. What is the mean 
for this random variable? 


Ans. 10 
6.38 What ts the largest value an exponential random variable may assume? 


Ans. There is no upper limit for an exponential distribution. 


PROBABILITIES FOR THE EXPONENTIAL PROBABILITY DISTRIBUTION 

6.39 The time that a family lives in a home between purchase and resale is exponentially distributed with a 
mean equal to 5 years. Let X represent the time between purchase and resale. (a) Find P(X < 3). (b) Find 
P(X > 5). 
Ans. (a) .4512  (b) .3679 


6.40 The time between orders received at a mail order company are exponentially distributed with a mean 
equal to 0.5 hour. What is the probability that the time between orders is between | and 2 hours? 


Ans. .1170 


Chapter 7 


Sampling Distributions 


SIMPLE RANDOM SAMPLING 


In order to obtain information about some population, either a census of the whole population is 
taken or a sample is chosen from the population and the information is inferred from the sample. 
The second approach is usually taken, since it is much cheaper to obtain a sample than to conduct a 
census. In choosing a sample, it is desirable to obtain one that is representative of the population. 
The average weight of the football players at a college would not be a representative estimate of the 
average weight of students attending the college, for example. A simple random sample of size n 
from a population of size N is one selected in such a way that every sample of size n has the same 
chance of occurring. In simple random sampling with replacement, a member of the population can 
be selected more than once. In simple random sampling without replacement, a member of the 
population can be selected at most once. Simple random sampling without replacement is the most 
common type of simple random sampling. 


EXAMPLE 7.1 Consider the population consisting of the world’s five busiest airports. This population 
consists of the following: A: Chicago O'Hare, B: Atlanta Hartsfield, C: London Heathrow, D: Dallas-Fort 
Worth, and E: Los Angeles Intl. The number of possible samples of size 2 from this population of size 5 is given 


ais! 
possible pair would have probability 0.1 of being the pair selected. That is, Chicago O’Hare and Atlanta 
Hartsfield would have probability 0.1 of being chosen, Chicago O’Hare and London Heathrow would have 
probability 0.1 of being chosen, etc. One way of ensuring that each pair would have an equal chance of being 
selected would be to write the names of the five airports on separate sheets of paper and select two of the sheets 
randomly from a box, 


5 5) 
by the combination of 5 items selected two at a lime, that is. |. —— = 10. In simple random sampling, each 


USING RANDOM NUMBER TABLES 


The technique of writing names on slips of paper and selecting them from a box is not practical 
for most real world situations. Tables of random numbers are available in a variety of sources. The 
digits 0 through 9 occur randomly throughout a random number table with each digit having an equal 
chance of occurring. Table 7.1 is an example of a random number table. This particular table has SO 
columns and 20 rows. To use a random number table, first randomly select a starting position and 
then move in any direction to select the numbers. 


EXAMPLE 7.2 The money section of USA Today gives the 1,900 most active New York Stock Exchange 
issues. The random numbers in Table 7.1 can be used to randomly select !0 of these issues. Imagine that the 
issues are numbered from 0001 to 1900. Suppose we randomly decide to start in row 1 and columns 21 through 
24. The four-digit number located here is 0345. Reading down these four columns and discarding any numbcr 
exceeding 1900, we obtain the following eight random numbers between OOO! and 1900: 0345, 1304, 0990, 
1580, 1461, 1064, 0676, and 0347. To obtain our other two numbers, we proceed to row 1 and columns 26 
through 29. Reading down this column, we find 1149 and 1074. To obtain the 10 stock issues, we read down the 
columns and select the ones located in positions 345, 347, 676, 990, 1064, 1074, 1149, 1304, 1461, and 1580. 
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Table 7.1 


Random Numbers (25 rows, $0 columns) 
01-05 06-10 11-15 16-20 21-25 26-30 31-35 36-40_—41-45 46-50 
87032 26561 44020 06061 22484 55858 61768 «12676-29353 
98340 94192 81975 69931 «13047-28533.» «34529» 02625-11020 65106 
70363 95651 64089 ~—-31921_~—«-09900 81554 53640-92109 0045909599 
09749 91862 12659 63079-—«91937--58272,-:90766 ~=—:09950-- 27996 ~—-29679 
84223 87730 © 91759 9304 89477 84221 «38566.» «496274-—-29195 


27190 92922 86046 09124 42493 26551 76639 15763 18068 38998 
47087 68993 24807 70755 36834 19522 59510 83888 51540 39119 
47335 69753 44311 33070 15800 92668 78460 39356 91692 34824 
31031 $2240 10346 02433 42534 84923 44548 Tae22 89959 92119 
60072 37318 07550 04411 43925 11499 93024 72791 60190 29692 


90202 45248 84967 67293 14612 99573 69573 98695 51303 44925 
91887 83092 39204 23539 98551 48427 25425 43864 10714 08308 
08264 (04860 05919 28393 21460 28370 43026 78296 58382 08276 
46655 67610 35334 44369 10649 10744 50515 01372 55081 34421 
30428 33957 53553 22925 06766 37433 45349 46565 47011 46762 


55238 40718 83328 97613 77718 16016 58590 =: 03726 0309] 
64993 84882 03067 19953 21077 27665 10583 62587 36875 00638 
90420 80152 10418 26576 = 40361 82421 61952 62713 04890 = 01032 
44621 76402 04778 58739 03474 00570 28368 60340 95227 39059 
15988 94013 71898 05785 17772 57471 75775 95202 06545 


USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE 


Most computer statistical software packages can be used to select random numbers and to some 
extent have replaced random number tables. As the capability and availability of computers continue 
to increase, many of the statistical tables are becoming obsolete. 


EXAMPLE 7.3 Minitab can be used to select the random sample of stock issues in Example 7.2. The 
commands are as follows. 


MTB > set cl 

DATA > 1:1900 
DATA > end 

MTB > sample 10 cl c2 
MTB > print c2 


Data Display 
C2 
1227 969 1834 1441 423 897 824 664 414 77 


The first three lines of command put the numbers | through 1900 into column |. The command on the fourth 
line asks for a sample of size 10 from the numbers in column cl and asks that the selected numbers be placed 
into column c2. The print command causes the random numbers to be printed. These are the numbers of the 10 
stocks to be selected. 
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SYSTEMATIC RANDOM SAMPLING 


Systematic random sampling consists of choosing a sample by randomly selecting the first 
element and then selecting every kth element thereafter. The systematic method of selecting a sample 
often saves time in selecting the sample units. 


EXAMPLE 7.4 In order to obtain a systematic sample of 50 of the nation’s 3,143 counties, divide 3,143 by 50 
to obtain 62.86. Round 62.86 down to obtain 62. From a list of the 3,143 counties, select one of the first 62 
counties at random. Suppose county number 35 is selected. To obtain the other 49 counties, add 62 to 35 to 
obtain 97, add 2 x 62 to 35 to obtain 159, and continue in this fashion until the number 49 x 62 + 35 = 3,073 is 
obtained. The counties numbered 35, 97, 159, ... , 3073 would represent a systematic sample from the nations 
counties. 


CLUSTER SAMPLING 


In cluster sampling, the population is divided into clusters and then a random sample of the 
clusters are selected. The selected clusters may be completely sampled or a random sample may be 
obtained from the selected clusters. 


EXAMPLE 7.5 A large company has 30 plants located throughout the United States. In order to access a new 
total quality plan, the 30 plants are considered to be clusters and five of the plants are randomly selected. All of 
the quality control personnel at the five selected plants are asked to evaluate the total quality plan. 


STRATIFIED SAMPLING 


In stratified sampling, the population is divided into strata and then a random sample is selected 
from each strata. The strata may be determined by income levels, different stores in a supermarket 
chain, different age groups, different governmental law enforcement agencies, and so forth. 


EXAMPLE 7.6 Super Value Discount has [0 stores. To assess job satisfaction, one percent of the employees 
at each of the 10 stores are administered a job satisfaction questionnaire. The 10 stores are the strata into which 
the population of all employees at Super Value Discount are divided. The results at the 10 stores are combined 
to evaluate the job satisfaction of the employees. 


SAMPLING DISTRIBUTION OF THE SAMPLING MEAN 


The mean of a population, 1, 1s a parameter that is often of interest but usually the value of 1 is 
unknown. In order to obtain information about the population mean, a sample is taken and the sample 
mean, x, is calculated. The value of the sample mean is determined by the sample actually selected. 
The sample mean can assume several different values, whereas the population mean is constant. The 
set of all possible values of the sample mean along with the probabilities of occurrence of the 
possible values is called the sampling distribution of the sampling mean. The following example will 
help illustrate the sampling distribution of the sample mean. 


EXAMPLE 7.7 Suppose the five cities with the most African-American-owned businesses measured in 
thousands is given in Table 7.2. 
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Table 7.2 


Number of African-American- 

City owned businesses, in thousands 
A: New York 42 
B: Washington, D.C. 39 
C: Los Angeles 36 
D: Chicago 33 
E: Atlanta 30 


If X represents the number of African-American-owned businesses in thousands for this population consisting of 
five cities, then the probability distribution for X is shown in Table 7.3. 


Table 7.3 


Se ee ee eee ee 


The population mean is = Z xP(x) = 30 x .2 + 33 x .2 + 36 x .2+ 39 x .2+ 42 x .2 = 36, and the variance is 
given by 0 = Xx’ P(x) - p” = 900 x .2 + 1089 x .2 + 1296 x .2 + 1521 x .2 + 1764 x .2 ~ 1296 = 1314 — 1296 
= 18. The population standard deviation is the square root of 18, or 4.24. 


The number of samples of size 3 possible from this population is equal to the number of combinations 
! 


possible when selecting three cities from five. The number of possible samples is Ci = ETry = 10. Using the 
letters A, B, C, D, and E rather than the name of the cities, Table 7.4 gives all the possible samples of three 


cities, the sample values, and the means of the samples. 


Table 7.4 


42, 39, 36 
42, 39, 33 
42, 39, 30 
42, 36, 33 
42, 36, 30 
42, 33, 30 
39, 36, 33 
39, 36, 30 
39, 33, 30 
36, 33, 30 


” 
. 


“ * “ . « 
_ my “ 


~ 


~ 


A, B, D 
A, B,E 
A,C, D 
A,C,E 
A, D, E 
B,C, D 
B,C,E 
B,D, E 
C, D,E 


. 
- 


The sampling distribution of the mean is obtained from Table 7.4. For random sampling, each of the 
samples in Table 7.4 is equally likely to be selected. The probability of selecting a sample with mean 39 is .] 
since only one of 10 samples has a mean of 39. The probability of selecting a sample with mean 36 is .2, since 
two of the samples have a mean equal to 36. Table 7.5 gives the sampling distribution of the sample mean. 


Table 7.5 


| x | 33 34 35 36 37 38 39 
| J 2 2 2 J J 
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SAMPLING ERROR 


When the sample mean is used to estimate the population mean, an error is usually made. This 
error is called the sampling error, and is defined to be the absolute difference between the sample 
mean and the population mean. The sampling error is defined by 


sampling error = ie [ | (7.1) 


EXAMPLE 7.8 In Example 7.7, the population of the five cities with the most African-American-owned 
businesses is given. The mean of this population is 36. Table 7.5 gives the possible sample means for all 
samples of size 3 selected from this population. Table 7.6 gives the sampling errors and probabilities associated 
with all the different sample means. 


Table 7.6 


Probabilit 


From Table 7.6, it is seen that the probability of no sampling error in this scenario is .20. There is a 60% chance 
that the sampling error is 1 or less. 


MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN 


Since the sample mean has a distribution, it is a random variable and has a mean and a standard 
deviation. The mean of the sample mean is represented by the symbol py, and the standard deviation 
of the sample mean is represented by o; . The standard deviation of the sample mean is referred to as 
the standard error of the mean. Example 7.9 illustrates how to find the mean of the sample mean and 
the standard error of the mean. 


EXAMPLE 7.9 In Example 7.7, the sampling distribution of the mean shown in Table 7.7 was obtained. 


Table 7.7 


The mean of the sample mean is found as follows: 
My ==L xX P(x) =33x.14+ 34.1435 «24 36x 2437 x 2438 x .14+39x.1 = 36 
The variance of the sample mean is found as follows: 
OF = EX P(X) - Wy 
o2 = 1089 x 1+ 1156.1 +1225 x .2+ 1296 x .2 + 1369 x .2+ 1444 x 1+ 1521 x1 ~ 1296 =3 


The standard error of the mean, ox, is equal to Nis a ae ee 
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The relationship between the mean of the sample mean and the population mean ts expressed by 
H, = (7.2) 


The relationship between the variance of the sample mean and the population variance is 
expressed by formula (7.3), where N is the population size and n is the sample size. 


2 N-n 
= x 
n N-1I 


(7.3) 


EXAMPLE 7.10 In Example 7.7, the population consisting of the five cities with the most African-American- 
owned businesses was introduced. The population mean, p, is equal to 36 and the variance, o’, is equal to 18. In 
Example 7.9, the mean of the sample mean, 1, , was shown to equal 36 and the variance of the sample mean, 
Oo; _ was shown to equal 3. It is seen that p =p, = 36, illustrating formula (7.2). To illustrate formula (7.3), note 
that 


2 NE ee 
2 z ge re 
3 5-1 


The standard error of the mean 1s found taking the square root of both sides of formula (7.3), and 
is given by 


oO 
7 = 7. 4 
10) i N-1 ( ) 


N-n 
is called the finite population correction factor. If the sample size n is less than 5% 


The term 


of the population size, i.e., n < .Q5N, the finite population correction factor is very near one and is 
omitted in formula (7.4). If n < .O5N, the standard error of the mean is given by 


Ox= ~~ (7.5) 


vn 


EXAMPLE 7.11 The mean cost per county in the United States to maintain county roads is $785 thousand per 
year and the standard deviation is $55 thousand. Approximately 4% of the counties are randomly selected and 
the mean cost for the sample is computed. The number of counties is 3,143 and the sample size is 125.The 
standard error of the mean using formula (7.4) is: 


ox = ee x (|B! _ 4.91935 x 98 = 4.82 thousand 
¥125 3,143-1 


The standard error of the mean using formula (7.5) is: 


Ox = == = 4.92 thousand 


Ignoring the finite population correction factor in this case changes the standard error by a small! amount. 
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SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLE MEAN 
AND THE CENTRAL LIMIT THEOREM 


If samples are selected from a population which is normally distributed with mean w and 
standard deviation ©, then the distribution of sample means is normally distributed and the mean of 


Pee ; eae ; . .9) 
this distribution is H; = —, and the standard deviation of this distribution is 6; = —~. The shape of 
n 


Ta 


the distribution of the sample means is normal or bell-shaped regardless of the sample size. 

The central limit theorem states that when sampling from a large population of any distribution 
shape, the sample means have a normal distribution whenever the sample size is 30 or mote. 
Furthermore, the mean of the distribution of sample means is pH; = uM, and the standard deviation of 


cae . . Oo ie of : 
this distribution is ox; = ee It is important to note that when sampling from a nonnormal 
n 


distribution, x has a normal distribution only if the sample size is 30 or more. The central limit 
theorem is illustrated graphically in Figure 7-1. 


f(x ) 


~| 


Mm 
Fig. 7-1 


Figure 7-1! illustrates that for samples greater than or equal to 30, x has a distribution that is bell- 
shaped and centers at u. The spread of the curve is determined by o,. 


EXAMPLE 7.12 If a large number of samples each of size n, where n is 30 or more, are selected and the 
means of the samples are calculated, then a histogram of the means will be bell-shaped regardless of the shape 
of the population distribution from which the samples are selected. However, if the sample size is less than 30, 
the histogram of the sample means may not be bell-shaped unless the samples are selected from a bell-shaped 
distribution. 


APPLICATIONS OF THE SAMPLING DISTRIBUTION 
OF THE SAMPLE MEAN 


The distribution properties of the sample mean are used to evaluate the results of sampling, and 
form the underpinnings of many of the statistical inference techniques found in the remaining 
chapters of this text. The examples in this section illustrate the usefulness of the central limit 
theorem. 


EXAMPLE 7.13 A government report states that the mean amount spent per capita for police protection for 
cities exceeding 150,000 in population is $500 and the standard deviation is $75. A criminal justice research 
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study found that for 40 such randomly selected citics, the average amount spent per capita for this sample for 
police protection is $465. If the government report is correct, the probability of finding a sample mean that is 
$35 or more below the national average is given by P(X < 465). The central limit theorem assures us that 
because the sample size exceeds 30, the sample mean has a normal distribution. Furthermore, if the government 


oe 75 
report is correct, the mean of X is $500 and the standard error is “a0 = $11.86. The area to the left of $465 
0 


= 465-500 
under the normal curve for X is the same as the area to the left of Z = gee = -2.95. We have the 
LL& 


following equality. P(X < 465) = P(Z < -2.95) = .0016. This result suggests that either we have a highly 
unusual sample or the government claim is incorrect. Figures 7-2 and 7-3 illustrate the solution graphically. 


£(x) 


f(z) 


—2.95 0 


Fig. 7-2 Fig. 7-3 


In Example 7.13, the vatue 465 was transformed to a z value by subtracting the population mean 500 from 465, 
and then dividing by the standard error 11.86. The equation for transforming a sample mean to a z value is 
shown tn formula (7.6): 


(7.6) 


EXAMPLE 7.14 A machine fills containers of coffee labeled as 113 grams. Because of machine variability, 
the amount per container is normally distributed with p = 113 and o = I. Each day, 4 of the containers are 
selected randomly and the mean amount in the 4 containers is determined. If the mean amount in the four 
containers is either less than {12 grams or greater than 114 grams, the machine is stopped and adjusted. Since 
the distribution of fills is normally distributed, the sample mean is normally distributed even for a sample as 
small as four. The mean of the sample mean is Ht; = 113 and the standard error is ox = .5. The machine is 
adjusted if xX < 112 or if x > 114. The probability the machine is adjusted is equal to the sum P(X < 112) + 
P(x > 114) since we add probabilities of events connected by the word or. To evaluate these probabilities, we 
use formula (7.6) to express the events involving x in terms of z as follows: 


P(x < 112) = P(A <a) = P(z < -2.00) = .0228 
X-113_ 114-113 
P(X > 114)=P( : a) = P(z > 2.00) = .0228 


The probability that the machine is adjusted is 2 x .0228 = .0456. It is seen that if this sampling technique is 
used to monitor this process, there is a 4.56% chance that the machine will be adjusted even though it is 
maintaining an average fill equal to 113 grams. 
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SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 


A population proportion is the proportion or percentage of a population that has a specified 
characteristic or attribute. The population proportion is represented by the letter p. The sample 
proportion is the proportion or percentage in the sample that has a specified characteristic or 
attribute. The sample proportion is represented by either the symbol p or p. We shall use the symbol 
p to represent the sample proportion in this text. 


EXAMPLE 7.15 The nation’s work force at a given time is [33,018,000 and the number of unemployed is 


7,355,000 
7,355,000. The proportion unemployed is p = TAREE SS: = .055 and the jobless rate is 5.5%. A sample of 
33,018,000 


size 65,000 is selected from the nation’s work force and 3,900 are unemployed in the sample. The sample 


3.900 
proportion unemployed is p = ———— = .06 and the sample jobless rate is 6%. 
65.000 


The population proportion p 1s a parameter measured on the complete population and is constant 
over some time interval. The sample proportion p is a statistic measured on a sample and is 
considered to be a random variable whose value is dependent on the sample chosen. The set of all 
possible values of a sample proportion along with the probabilities corresponding to those values is 
called the sampling distribution of the sample proportion. 


EXAMPLE 7.16 According to recent data, the nation's five most popular theme parks are shown in Table 7.8. 
The table gives the name of the theme park and indicates whether or not the attendance exceeds 10 million per 
year. 


Table 7.8 


Theme park Attendance exceeds J0 million 


A: Disneyland (Anaheim) 
B: Magic Kingdom (Disney World) 


C: Epcot (Disney World) 
D: Disney/MGM Studios (Disney World) 
E: Universal Studios Florida (Orlando 


For this population of size N = 5, the proportion of theme parks with attendance exceeding 10 million ts p = : = 


.60 or 60%. There are 5 samples of size 4 possible when selected from this population. These samples, the 
theme parks exceeding LO million (yes or no), the sample proportion, and the probability associated with the 
sample proportion, are given in Table 7.9. 


Table 7.9 


Probability 


2 
i 
2 
2 


A 
A, 
A 
A 
B 


For each sample, the sampling error, Ip ~p |. is either .10 or .15. From Table 7.9, it is seen that the probability 
associated with sample proportion .75 is .2 + .2 =.4 and the probability associated with sample proportion .50 is 
.2+.2+.2=.6. The sampling distribution for p is given in Table 7.10. 
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Table 7.10 


For larger populations and samples, the sampling distribution of the sample proportion ts more difficult to 
construct, but the technique is the same. 


MEAN AND STANDARD DEVIATION OF THE SAMPLE PROPORTION 


Since the sample proportion has a distribution, it is a random variable and has a mean and a 
standard deviation. The mean of the sample proportion is represented by the symbol ut; and the 


standard deviation of the sample proportion ts represented by oj. The standard deviation of the 


sample proportion is called the standard error of the proportion. Example 7.17 illustrates how to 
find the mean of the sample proportion and the standard error of the proportion. 


EXAMPLE 7.17 For the sampling distribution of the sample proportion shown tn Table 7.10, the mean 1s 


Ws = Ly pP(p) =.5x 64.75 x 4=0.6 
The variance of the sample proportion ts 
c= Lp Pp) - ws =.25x .6 + 5625 x.4- 36 = 0.015 
The standard error of the proportion, Gp, is equal to ¥0.015 = 0.122. 


The relationship between the mean of the sample proportion and the population proportion is 


expressed by 
HZ =p (7.7) 


The standard error of the sample proportion is related to the population proportion, the population 


size, and the sample size by 
falee N-n 
Sj 7.8 


N-n. ; : F ; . 
is called the finite population correction factor. If the sample size n is less than 5% 


The term 


of the population size, 1.e., n < .OSN, the finite population correction factor is very near one and is 
omitted in formula (7.8). 
If n < .OSN, the standard error of the proportion is given by formula (7.9), where q = | - p. 


o = jo (7.9) 


n 


EXAMPLE 7.18 [In Example 7.16, dealing with the five most popular theme parks, it was shown that p = 0.6 
and in Example 7.17, it was shown that , = 0.6 illustrating formula (7.7). In Example 7.17 it was also shown 


thal G, = V¥0.0t5 =0.122. To illustrate formula (7.8), note that 
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p(l~ p) N-n 6x 4 5-4 
joe x \—— = | x {— = ¥0.0I5 =0.122 = 
n N-! 4 Sel me 


EXAMPLE 7.19 Suppose 80% of all companies use e-mail. In a survey of 100 companies, the standard error 


. oe 8 x 2 : 
of the sample proportion using e-mail is oj = is = oe = .04. The finite population correction factor 
n 


is not needed since there ts a very large number of companies, and it is reasonable to assume that n< .OSN. 


SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 
AND THE CENTRAL LIMIT THEOREM 


When the sample size satisfies the inequalities np > 5 and nq > 5, the sampling distribution of the 
sample proportion is normally distributed with mean equal to p and standard error o;. This result is 
sometimes referred to as the central limit theorem for the sample proportion. This result 1s illustrated 
in Fig. 7-4. 


EXAMPLE 7.20 Approximately 20% of the adults 25 and older have a bachelor's degree in the United States. 
If a large number of samples of adults 25 and older, each of size 100, were taken across the country, then the 
sample proportions having a bachelor’s degree would vary from sample to sample. The distribution of sample 


toad . 20x 
proportions would be normally distributed with a mean equal to 20% and a standard error equal to i —— = 


4%. According to the empirical rule, approximately 68% of the sample proportions would fall between 16% and 
24%, approximately 95% of the sample proportions would fall between 12% and 28%, and approximately 
99.7% would fall between 8% and 32%. The sample proportion distribution may be assumed to be normally 
distributed since np = 20 and nq = 80 are both greater than S. 


F(p) 


Tl 


Pp 
Fig. 7-4 


APPLICATIONS OF THE SAMPLING DISTRIBUTION 
OF THE SAMPLE PROPORTION 


The theory underlying the sample proportion is utilized in numerous statistical applications. The 
margin of error, control chart limits, and many other useful statistical techniques make use of the 
sampling distribution theory connected with the sample proportion. 
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EXAMPLE 7.21 It is estimated that 42% of women and 36% of men ages 45 to 54 are overweight, that is, 
20% over their desirable weight. The probability that one-half or more in a sample of 50 women ages 45 to 54 
are overweight is expressed as P(p 2 .5). The distribution of the sample proportion is normal since np = 50 x 


A2=21>5 and ng = 50 x 58 = 29 > 5, The mean of the distribution is .42 and the standard error is G5 = {> 
n 


4 , ev a ; ; 
= “= = .07. The probability, P(p 2 .5), is shown as the shaded area in Fig. 7-5. 


f(p) 
4 


Pr / 


<a P 
Ae Ss 
Fig. 7-5 
To find the area under the normal curve shown in Fig. 7-5, it is necessary to transform the value .5 to a standard 
5 ty 5 - A2 . as : 
normal value. The value for z ts z = ane 1.14. The area to the right of p = .5 1s equal to the area to the 


right of z = 1.14 and is given as follows. 
P(p 2.5) =P(z > 1.14) = .5 —.3729 = .1272 
In Example 7.21, the p value equal to .5 was transformed to a z value by subtracting the mean 


value of p from .5 and then dividing the result by the standard error of the proportion. The equation 
for transforming a sample proportion value to a z value is given by 


(7.10) 


EXAMPLE 7.22 It is estimated that | out of 5 individuals over 65 have Parkinsonism, that is, signs of 
Parkinson’s disease. The probability that 15% or less in a sample of 100 such individuals have Parkinsonism is 
represented as P(p $ 15%). Since np = 100 x .20 = 20 > 5 and ng = 100 x .80 = 80 > 5, it may be assumed that 


Say nak =, 20x 
p has a normal distribution. The mean of p is 20% and the standard error is of = {- = a = 4%, 
n 


Formula (7./0) is used to find an event involving z which is equivalent to the event p < 15. The event p < 15 


p-20 15-20 a 
< = —].25, and therefore we have P(p $ 15%) = P(z < -1.25) = 
4 


is equivalent to the event z = 


5 ~ 3944 = 1056. 
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Solved Problems 


SIMPLE RANDOM SAMPLING 


7.1 The top nine medical doctor specialties in terms of median income are as follows: radiology, 
surgery, anesthesiology, obstetrics/gynecology, pathology, internal medicine, psychiatry, 
family practice, and pediatrics. How many simple random samples of size 4, chosen without 
replacement, are possible when selected from the population consisting of these nine 
specialties? What is the probability associated with each possible simple random sample of 
size 4 from the population consisting of these nine specialties? 


9 9! 
Ans: The number of possible samples is equal to | = —— = 126. Each possible sample has proba- 


415! 


bility = = .00794 of being selected. 


7.2 In problem 7.1, how many of the 126 possible samples of size 4 would include the specialty 
pediatrics? 


Ans: The number of samples of size 4 which include the specialty pediatrics would equal the number of 
ways the other three specialties in the sample could be selected from the other eight specialties in 


8 
the population. The number of ways to select 3 from 8 is : = a = 56. One such sample is the 
sample consisting of the following: (radiology, surgery, pathology, pediatrics). There are 55 other 
such samples containing the specialty pediatrics. 


USING RANDOM NUMBER TABLES 


Table 7.11 


1. Alabama 18. Kentuck 35. North Dakota 
2. Alaska 36. Ohio 
37. Oklahoma 


21. Maryland 
5. California 22. Massachusetts 39. Pennsylvania 


41. South Carolina 


42. South Dakota 
9. District of Columbia 


34, North Carolina 51. Wyoming 


7.3. Use Table 7.1 of this chapter to obtain a sample of size 5 from the population consisting of the 
50 states and the District of Columbia. Assume that the 51 members of the population are 
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7.4 


listed in alphabetical as shown in Table 7.11. Start with the two digits in columns 11 and 12, 
and line 1 of Table 7.1. Read down these two columns until five unique states have been 
selected. Give the numbers and the selected states. 


Ans: The random numbers are 44, 12, 24, 10, and 07. From an alphabetical listing of the 50 states and 
the District of Columbia in Table 7.11, the following sample is obtained: (Connecticut, Florida, 
Hawaii, Minnesota, and Texas) 


The random numbers in Table 7.1 are used to select 15 faculty members from an alphabetical 
listing of the 425 faculty members at a university. The total faculty listing goes from 001: 
Lawrence Allison to 425: Carol Ziebarth. A random start position ts determined to be columns 
31 through 33 and line 1. Read down these columns until you reach the end. Then go to 
columns 36 through 38, line | and read down these columns until you reach the end. If 
necessary, go to columns 41 through 43 and line | and read down these columns until you 
obtain 15 unique numbers. Give the 15 numbers which determine the 15 selected faculty 
members. 


Ans: The numbers, in the order obtained from Table 7.1 are: 345, 254, 160, 105, 283, 026, 099, 385, 
157, 393, 013, 126, 110, 004, and 279. 


USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE 


FE) 


7.6 


In reference to problem 7.3, use Minitab to obtain a sample of 5 of the 50 states plus the 
District of Columbia. 


Ans: MTB >setcl 
DATA > 1:51 
DATA > end 
MTB > sample 5 cl c2 
MTB > print c2 


Data Display 
C2 


3 21 38 44 36 


From the alphabetical listing of the states shown in Table 7.11, the following sample is obtained: 
(Arizona, Maryland, Ohio, Oregon, and Texas) 


Texas is the only state which is common to the samples obtained tn problems 7.3 and 7.5. 


In reference to problem 7.4, use Minitab to obtain a sample of 15 of the 425 faculty members. 


Ans: MTB > setcl 
DATA > 1:425 
DATA > end 
MTB > sample 15 cl c2 
MTB > sort c2 put into c3 
MTB > print c3 


Data Display 

C3 

58 73 (tl Ul2 187 228 239 283 285 319 322 325 364 384 394 

The faculty member corresponding to number 283 ts the only one that is common to the samples 
obtained in problems 7.4 and 7.6. 
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SYSTEMATIC RANDOM SAMPLING 


7.7 In reference to problem 7.3, choose a systematic sample of 5 of the SO states plus the District 
of Columbia. 


Ans: Dhtvide 51 by 5 to obtain 10.2. Round 10.2 down to 10 and select a number randomly between | 
and 10. Suppose the number 7 is obtained. The other 4 members of the sample are 17, 27, 37, and 
47. From the alphabetical listing of the states given in Table 7.11, the following sample is selected: 


(Connecticut, Kansas, Montana, Oklahoma, Virginia) 


CLUSTER SAMPLING 


7.8 A large school district wishes to obtain a measure of the mathematical competency of the 
junior high students in the district. The district consists of 40 different junior high schools. 
Describe how you could use cluster sampling to obtain a measure of the mathematical 
competency of the junior high school students in the district. 


Ans: Randomly select a small number, say 5, of the 40 junior high schools. Then administer a test of 


mathematical competency to all students in the 5 selected schools. The test results from the 5 
schools constitute a cluster sample. 


STRATIFIED SAMPLING 


7.9 Refer to problem 7.8. Explain how you could use stratified sampling to determine the 
mathematical competency of the junior high students in the district. 


Ans; Consider each of the 40 junior high schools as a stratum. Randomly select a sample from each 
junior high proportional to the number of students in that junior high school and administer the 


mathematical competency test to the selected students. Note that even though stratified sampling 
could be used, cluster sampling as described in problem 7.8 would be easier to administer. 


SAMPLING DISTRIBUTION OF THE SAMPLING MEAN 


7.10 The five cities with the most African-American-owned businesses was given in Table 7.2 and 
is reproduced below. 


Table 7.2 


Number of African-American-owned 
City businesses, in thousands 


A: New York 


B: Washington, D.C. 
C: Los Angeles 
D: Chicago 
E: Atlanta 


List all samples of size 4 and find the mean of each sample. Also, construct the sampling 
distribution of the sample mean. 
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Ans: Table 7.12 lists the 5 samples of size 4, along with the sample values for each sample, and the 
mean of each sample, and Table 7.13 gives the sampling distribution of the sample mean. 


Table 7.12 


Number of businesses 
Samples in the samples Sample mean 


42, 39, 36, 33 
42, 39, 36, 30 
42, 39, 33, 30 
42, 36, 33, 30 
39, 36, 33, 30 


Table 7.13 


La} 34.50 35.25 36.00 36.5 3150 


7.11 Consider sampling from an infinite population. Suppose the number of firearms of any type per 
household in this population is distributed uniformly with 25% having no firearms, 25% 
having exactly one, 25% having exactly two, and 25% having exactly three. If X represents the 
number of firearms per household, then X has the distribution shown in Table 7.14. 


Table 7.14 


25 25 25 25 


Give the sampling distribution of the sample mean for all samples of size two from this 
population. 


Table 7.15 


Possible sample values 
x, and x, Sample mean Probability 


-) 


Py 


’ 


~ 


Py 


ay 


“ 


0,0 
0, | 
0, 2 
0,3 
1,0 
1,1 
Be: 
1,3 
2,0 
22.1 
2-2 
2,3 
3,0 
3,1 
one 
3,3 


« 


Ans: Suppose x, and x2 represents the two possible values for the two households sampled. Table 7.15 
gives the possible values for x; and x2, the mean of the sample values, and the probability 
associated with each pair of sample values. Since there are 16 different possible sample pairs each 


one has probability a = .0625. The possible values for the sample mean are 0, 0.5, 1.0, 1.5, 2.0, 
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2.5, and 3.0. The probability associated with each sample mean value can be determined from 
Table 7.15. For example, the sample mean equals 1.5 for four different pairs and the probability 
for the value 1.5 is obtained by adding the probabilities associated with (0, 3), (1. 2), (2. 1), and 
(3, 0). That is, P(x = 1.5) = .0625 + .0625 + .0625 + .0625 = .25. The sampling distribution is 
shown in Table 7.16. 


Table 7.16 


a ee 1.0 1.5 ot nee 3.0 
LPCK) | 


rae ps0 1875 2500 1875 1250 0625 


SAMPLING ERROR 


7.12 Give the sampling error associated with each of the samples in problem 7.10. 


Ans: The mean number of African-American-owned businesses for the five cities is 36,000. There are 
five different possible samples as shown in Table 7.17. Each of the sample means shown in Table 
7.17 ts an estimate of the population mean. The sampling error is |x — 36 |, and is shown for each 
sample in the last column in Table 7.17. 


Table 7.17 


7.13 Give the minimum and the maximum sampling error encountered in the sampling described in 
problem 7.11 and the probability associated with the minimum and maximum sampling error. 


Ans: The population described in problem 7.11 has a uniform distribution and the mean is as follows: 
w=Ox 2541 2542.25 +3 x .25 = [.5. The minimum error is 0 and occurs when x = [.5. 
The probability of a sampling error equal to 0 is equal to the probability that the sample mean is 
equal to 1.5 which is .25 as shown in Table 7.16. The maximum sampling error is 1.5 and occurs 
when x = 0 or x = 3.0. The probability associated with the maximum sampling error is .0625 + 
0625 = .125. 


MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN 


7.14 Find the mean and variance of the sampling distribution of the sample mean derived in 
problem 7.10 and given in Table 7.13. Verify that formulas (7.2) and (7.3) hold for this 
problem. 

Ans: Table 7.13 is reprinted below for ease of reference. The mean of the sample mean is 


Me =2 xX P(X) = 34.50 x .2 + 35.25 x .2 + 36.00 x .2 + 36.75 x .2 + 37.50 x .2 = 36 
The variance of the sample mean is o2 = x xX °P(X) - Te ; 


© x7P(K) = 1190.25 x .2 + 1242.5625 x .2 + 1296 x .2 + 1350.5625 x .2 + 1406.25 x .2 
= 1297.125 and u; = 1296, and therefore Oo; = 1297.125 - 1296 = 1.125. 
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Table 7.13 


34.50 35.25. 36.00 36.75 37.50 
2 2 2 ae 2 


In Example 7.7 it is shown that p = 36 and o = 18. Formula (7.2) is verified since p = Hy = 36. 


o N-n_ 18 5~4 
To verify formula (7.3), note that —x—— = —x——— =1.125= o3. 
n N-I 4 5-1 


7.15 Find the mean and variance of the sampling distribution of the sample mean derived in 
problem 7.11 and given in Table 7.16. Verify that formulas (7.2) and (7.3) hold for this 
problem. 


Ans: Table 7.16 is reprinted below for ease of reference. The mean of the sample mean is 
Hy; =Zx P(X) 
=0 x .0625 + 0.5 x .125 + 1 x $875 4+ 1.5 « .25 +2 1875 +2.5 x 125 + 3 x .0625 


= 1.5. The variance of the sample mean is o2 = © x’P(x)- us. 


Ex °P(X) = 0x 0625 + .25 x 125 + 1 x 1875 + 2.25 x .25 +4 x 1875 + 6.25 x 125 +9 x 0625 
= 2.875 and 2 = 2.25, and therefore a? = 2.875 — 2.25 = 0.625. 


Table 7.16 


L P(x) | .06 


0625 .1250 —.1875 2500 == .1875 1250 0625 


In order to verify formulas (7.2) and (7.3), we need to find the values for and o” in problem 
7.11. The population distribution was given in table 7.14 and is shown below. 


Table 7.14 


The population mean is p = X xP(x) =O x .25 + 1 x .254+2x .25 +3 .25 = 1.5, and the variance 
is given by 0? = Dx°P(x)- pw? =Ox .254+1%x.25+4x.25+9x 25 —2.25 = 1.25. 


Formula (7.2) 1s verified, since p = HH; = 1.5. To verify formula (7.3), note that since the 
2 1. 

population is infinite, the finite population correction factor is not needed and S =— = 0625 
n 2 


= Ge: Anytime the population is infinite, the finite population correction factor is omitted and 


formula (7.3) simplifies to o2 = = 
n 


SHAPE OF THE SAMPLING DISTRIBUTION OF THE MEAN 
AND THE CENTRAL LIMIT THEOREM 


7.16 The portfolios of wealthy people over the age of 50 produce yearly retirement incomes which 
are normally distributed with mean equal to $125,000 and standard deviation equal to $25,000. 
Describe the distribution of the means of samples of size 16 from this population. 
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Ans: Since the population of yearly retirement incomes are normally distributed, the distribution of 
sample means will be normally distributed for any sample size. The distribution of sample means 
o 25,000 


The mean age of nonresidential buildings is 30 years and the standard deviation of the ages of 
nonresidential buildings is 5 years. The distribution of the ages is not normally distributed. 
Describe the distribution of the means of samples from the distribution of ages of non- 
residential buildings for sample sizes n = 10 and n = 50. 


Ans: The distribution type for the sample mean is unknown for small samples, i-e., n < 30. Therefore we 


cannot say anything about the distribution of xX when n = 10. For large samples, i.e., 230, the 
central limit theorem assures us that x has a normal distribution. Therefore, for n = 50, the sample 


years. 


APPLICATIONS OF THE SAMPLING DISTRIBUTION OF 
THE SAMPLE MEAN 


7.18 


In problem 7.16, find the probability of selecting a sample of 16 wealthy individuals whose 
portfolios produce a mean retirement income exceeding $135,000. 


Ans: We are asked to determine the probability that x exceeds $135,000. From problem 7.16, we know 

that k has a normal distribution with mean $125,000 and standard error $6,250. The event 
x -125,000 = 135,000 - 125,000 
a > a ee 


6,250 6,250 
event kK > 135,000 is equivalent to the event z > 1.60, the two events have equal probabilities. 
That is, P(x > 135,000) = P(Z > 1.60). Using the standard normal distribution table, the 
probability P(z > 1.60) is equal to .5 — 4452 = .0548. That is, P(x > 135,000) = P(Z > 1.60) = 
0548. 


x > 135,000 is equivalent to the event z = = 1.60. Since the 


7.19 In problem 7.17, determine the probability that a random sample of 50 nonresidential buildings 


will have a mean age of 27.5 years or less. 


Ans: We are asked to determine P(X < 27.5). From problem 7.17, we know that xX has a normal 


distribution with mean 30 years and standard error 0.707 years. Since the event X < 27.5 is 
xX-30 275-30 


< 
107 107 
From the standard normal distribution table, the probability 1s less than .001. 


equivalent to the event z = = -3.54, we have P(X < 27.5) = P(z < ~3.54). 


SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 


7.20 Approximately 8.8 million of the 12.8 million individuals receiving Aid to Families with 


Dependent children are 18 or younger. What proportion of the individuals receiving such aid 
are 18 or younger? 
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Ans: The population consists of all individuals receiving Aid to Families with Dependent Children. The 
proportion having the characteristic that an individual receiving such aid is 18 or younger is 


8.8 
represented by p and is equal to 18 = 0.69. 


7.21. Table 7.18 describes a population consisting of five states and indicates whether or not there is 
at least one woman on death row for that state. 


Table 7.18 


At least one woman on death row 


A: Alabama yes 
yes 
no 
no 
no 


B: California 
C: Colorado 
D: Kentucky 
E: Nebraska 
For this population 40% of the states have at least one woman on death row, that is, p = 0.4. 
List all samples of size 2 from this population and find the proportion having at least one 
woman on death row for each sample. Use this listing to derive the sampling distribution for 
the sample proportion. 


Ans: Table 7.19 lists all possible samples of size 2, indicates whether or not the states in the sample 
have at lcast one woman on death row, gives the sample proportion for each sample, and gives the 
probability for each sample. 


Table 7.19 


At least one woman Sample 
on death row proportion Probability 
y.y 


Sample 


. 


- a ~ 


- 


VANNwWwMO>>>> 
mmomoamouays 


POT UAUNAUAnan 


my 


From Table 7.19, it is seen that p takes on the values 0, .5, and | with probabilities .3, .6, and .1, 
respectively. The probability that p = 0 is obtained by adding the probabilities for the samples (C, 
D), (C, E), and (D, E) sincep = 0 if any one of these samples is selected. The other two 
probabilities are obtained similarly. Table 7.20 gives the sampling distribution for p . 


Table 7.20 
ee ee ee oe 
MEAN AND STANDARD DEVIATION OF THE SAMPLE PROPORTION 


7.22 Find the mean and variance of the sample proportion in problem 7.2! and verify that formula 
(7.7) and formula (7.8) are satisfied. 
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Ans: The mean of the sample proportion ts H.= >pP(p) =Ox 3+ 5x 6+ 1x .1 =.4, and the 


variance of the sample proportion ts O5 = Lp’ P(p) - Ms =Ox .34.25x .64+1x.1-(.4) 


= .09. Formula (7.7) is satisfied since Ms = p= 4. To verify formula (7.8), we need to show that 


| }- [N- we 8 
0; = pp) ,. are Since the variance of the sample proportion is .09, the standard 
n = 


deviation is the square root of .09 or .3. Therefore, to verify formula (7.8), all we need do is show 
that the right-hand side of the equation equals .3. 


PU=p) Non _ 4x6, 27? . 3-6, 
V n VN-1 ) 2 V5-1 


7.23 Suppose 5% of all adults in America have 10 or more credit cards. Find the standard error of 
the sample proportion in a sample of 1,000 American adults who have 10 or more credit cards. 


Ans: Since the sample size is less than 5% of the population size, the standard error is g5 = et = 
n 


05 x 95 
1,000 


= 0.0069. 


SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 
AND THE CENTRAL LIMIT THEOREM 


7.24 A sample of size n is selected from a large population. The proportion in the population with a 
specified characteristic is p. The proportion in the sample with the specified characteristic is 
p. In which of the following does p have a normal distribution? 
(a) n=20,p=.9 (6) n=15, p=4 (c) n=100, p=.97 (d) n=1000,p=.01 


Ans: (a) np= 18, and ng = 2. Since both np and ng do not exceed 5, we are not sure that the 
distribution of p ts normal. 


(b) np = 6, and nq = 9. Since both np and nq exceed 5, the distribution of Pp is normal. 
(c) np = 97, and nq = 3. Since both np and nq do not exceed 5, we are not sure that the 
distribution of p is normal. 


(d) np = 10, and ng = 990. Since both np and nq exceed 5, the distribution of p is normal. 


APPLICATIONS OF THE SAMPLING DISTRIBUTION 
OF THE SAMPLE PROPORTION 


7.25 Approximately 15% of the population is left-handed. What is the probability that in a sample 
of 50 randomly chosen individuals, 30% or more in the sample will be left-handed? That is, 
what is the probability of finding 15 or more left-handers in the 50? 


Ans: The sample proportion, p, has a normal distribution since np = 50 x .15 = 7.5 > 5 and ng = 50 x 


.85 = 42.5 > 5. The mean of the sample proportion is 15% and the standard error is o5 = jes = 
n 


AH —— = 5.05%. To find the probability that p 2 30%, we first transform p to a standard 
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P as follows. The inequality p = 30 is equivalent to z = 


normal by the transormation z = 


OF 


a, 


505 5.05 
P(z 2 2.97) = .5 — .4985 = .OOIS. That is, 15 or more left-handers will be found in a sample of 50 


individuals only about 0.2% of the time. 


p-15 30-15 
2 = 2.97. Because p 2 30 is equivalent to z 2 2.97 we have P(p 2 30) = 


7.26 Suppose 70% of the population support the ban on assault weapons. What is the probability 


that between 65% and 75% in a poll of 100 individuals will support the ban on assault 
weapons? 


Ans: We are asked to find P(65 < p < 75). The sample proportion p has a normal distribution with 


7 0 0x3 
mean 70% and standard error 65 a — = 4.58%. The event 65 < p < 75 is 


65-70 p-70 75- os 


<< ae 
458 4.58 458 
equivalent, they have equal probabilities. That is, P(65 < p < 75) = P(-1.09 <z< 1.09) =2x 


equivalent to the event or -1.09 < z < 1.09 and since these events are 


3621 = .7242. 
Supplementary Problems 
SIMPLE RANDOM SAMPLING 
7.27 Rather than have all 25 students in her Statistics class complete a teacher evaluation form, Mrs. Jones 
decides to randomly select three students and have the department chairman interview the three 
concerning her teaching after the course grade has been given. How many different samples of size 3 can 
be selected? 
Ans. 2,300 
7.28 USA Today lists the [900 most active New York stock exchange issues. How many samples of size three 


are possible when selected from these 1900 stock issues? 


Ans. 1,141,362,300 


USING RANDOM NUMBER TABLES 


7.29 


7.30 


In a table of random numbers such as Table 7.1! what relative frequency would you expect for each of the 
digits 0, I, 2, 3, 4, 5, 6, 7, 8, and 9? 


Ans. 0.1 

The 100 U.S. Senators are listed in alphabetical order and then numbered as 00, OL, ... , 99. Use Table 
7.1 to select 10 of the senators. Start with the two digits in columns 3! and 32 and row 6. The first 
selected senator is numbered 76. Reading down the two columns from the number 76, what are the other 


9 two-digit numbers of the other selected senators? 


Ans. 59,78, 44, 93, 69, 25, 43, 50, and 45 
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USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE 


7.31 Use Minitab to select 25 of the 1900 stock issues discussed in problem 7.28. Give the numbers in 
ascending order of the 25 selected stock issues. Assume the stock issues are numbered from | to 1900. 


Ans. MTB>setcl 
DATA > 1[:1900 


DATA > end 
MTB > sample 25 from c! put into c2 
MTB > sort c2 put into c3 


MTB > print c3 


Data Display 
C3 

26 219 
1197 =: 1234 
1788 1798 


238 
1238 


1870 


288 


1257 


328 
1313 


423 
1532 


766 785 943, 1006 ~=—s-:1:187 
1562 1613 1639 = 1728 {784 


7,32 Use Minitab to select 50 of the 3143 counties in the United States. Give the numbers in ascending order 
of the 50 selected counties. Assume the counties are in alphabetical order and are numbered from 1 to 


3143. 


Ans. MTB >setcl 
DATA > 1:3143 


DATA > end 
MTB > sample 50 from cl put into c2 
MTB > sort c2 put into c3 


MTB > print c3 


Data Display 
C3 
33 47 
652 658 
1463 =-1567 
2270 = 2285 
2718 2763 


53 


764 
1630 
2306 
2862 


139 
789 
1694 
2311 
2892 


SYSTEMATIC RANDOM SAMPLING 


265 
829 
1706 
2408 
2917 


312 
949 
1818 
2414 
2975 


321 343 444 519 599 
964 1063 1134 1209 ~~ 1300 
1848 =2021 = =62048)— 2138) )—2143 
2463 92487) =. 2513) 2658 =. 2701 


7.33 Describe how the state patrol might obtain a systematic random sample of the speeds of vehicles along a 


stretch of interstate 80. 


Ans. Use a radar unit to measure the speeds of systematically chosen vehicles along the stretch of 
interstate 80. For example, measure the speed of every tenth vehicle. 


CLUSTER SAMPLING 


7.34 A particular city is composed of 850 blocks and each block contains approximately 20 homes. Fifteen of 
the 850 blocks are randomly selected and each household on the selected block is administered a survey 
concerning issues of interest to the city council. How large is the population? How large is the sample? 
What type of sampling is being used? 


Ans. The population consists of 17,000 households. The sample consists of 300 households. Cluster 


sampling is being used. 
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STRATIFIED SAMPLING 


7.35 A drug store chain has stores located in 5 cities as follows: 20 stores in Los Angeles, 40 stores in New 
York, 20 stores in Seattle, 10 stores in Omaha, and 30 stores in Chicago. In order to estimate pharmacy 
sales, the following number of stores are randomly selected from the 5 cities: 4 from Los Angeles, 8 from 
New York, 4 from Seattle, 2 from Omaha, and 6 from Chicago. What type of sampling is being used? 


Ans. stratified sampling 


SAMPLING DISTRIBUTION OF THE SAMPLING MEAN 


7.36 The export value in billions of dollars for four American cities for a recent year are as follows: A: 
Detroit, 28; B: New York, 24; C: Los Angeles, 22; and D: Seattle, 20. If all possible samples of size 2 are 
selected from this population of four cities and the sample mean value of exports computed for each 
sample, give the sampling distribution of the sample mean. 


Ans. P(X)= = , where K = 21, 22, 23, 24, 25, or 26. 


7.37 Consider a large population of households. Ten percent of the households have no home computer, 60 
percent of the households have exactly one home computer, and 30 percent of the households have 
exactly two home computers. Construct the sampling distribution for the mean of all possible samples of 


size 2. 
Ans. 
0 5 1 15 5 
Ol 12 42 36 09 
SAMPLING ERROR 


7.38 In reference to problem 7.36, if the mean of a sample of two cities is used to estimate the mean export 
value of the four cities, what are the minimum and maximum values for the sampling error? 


Ans. minimum sampling error = .5 maximum sampling error = 2.5 
7.39 In reference to problem 7.37, what are the possible sampling error values associated with samples of two 
households used to estimate the mean number of home computers per household for the population? 


What is the most likely value for the sampling error? 


Ans. 2, .3,.7, .8, and [.2 The most likely value is .2. The probability that the sampling error equals .2 is 
42. 


MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN 


7.40 Find the mean and variance of the sampling distribution in problem 7.36. Verify that formulas (7.2) and 
(7.3) hold for this sampling distribution. 


Ans. w=23.5 0° =8.75 7=235 of =2.917 


7.41 Find the mean and variance of the sampling distribution in problem 7.37. Verify that formulas (7.2) and 
(7.3) hold for this sampling distribution. 
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Ans. p=l2 0 =036 by=l2 of =0.18 


ge 
Mz; =H Since the population ts infinite, formula (7.3) becomes 6; = —. 
n 
os 36 5 
—=—=0.18= OZ. 
n Zz 


SHAPE OF THE SAMPLING DISTRIBUTION OF THE MEAN 
AND THE CENTRAL LIMIT THEOREM 


7.42 For cities of 100,000 or more, the number of violent crimes per 1,000 residents is normally distributed 
with mean equal to 30 and standard deviation equal to 7. Describe the sampling distribution of means of 
samples of size 4 from such cities. 


Ans. The means of samples of size 4 will be normally distributed with mean equal to 30 and standard 
error equal to 3.5. 


7.43 For cities of 100,000 or more, the mean total crime rate per 1,000 residents is 95 and the standard 
deviation of the total crime rate per 1,000 is 15. The distribution of the total crime rate per 1,000 
residents for cities of 100,000 or more ts not normally distributed. The distribution is skewed to the right. 
Describe the sampling distribution of the means of samples of sizes 10 and 50 from such cities. 


Ans. For samples of size 50, the central limit theorem assures us that the distribution of sample means is 


normally distributed with mean equal to 95 and standard error equal to 2.12. For samples of size 
10 from a nonnormal distribution, the distribution form of the sample means is unknown. 


APPLICATIONS OF THE SAMPLING DISTRIBUTION OF THE SAMPLE MEAN 


7.44 In problem 7.42, find the probability of selecting four cities of population 100,000 or more whose mean 
number of violent crimes per 1,000 residents exceeds 40 violent crimes per | ,000 residents. 


Ans. 0.0021 
7.45 The mean number of bumped passengers per 10,000 passengers per day is 1.35 and the standard 
deviation is 0.25. For a random selection of 40 days, what is the probability that the mean number of 


bumped passengers per 10,000 passengers for the 40 days will be between [.25 and 1.50 ? 


Ans. 0.9938 


SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 


7.46 The world output of bicycles in 1995 was 114 million. China produced 41 million bicycles in 1995. What 
proportion of the world’s new bicycles in 1995 were produced by China? 


Ans. p = 0.36 
7.47 A company has 38,000 employees and the proportion of the employees who have a college degree equals 
().25. In a sample of 400 of the employees, 30 percent have a college degree. How many of the company 


employees have a college degree? How many in the sample have a college degree? 


Ans. 9,500 of the company employees have a college degree and 120 in the sample have a degree. 
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MEAN AND STANDARD DEVIATION OF THE SAMPLE PROPORTION 


7.48 


7.49 


One percent of the vitamin C tablets produced by an industrial process are broken. The process fills 
containers with !,000 tablets each. At regular intervals, a container is selected and 100 of the 1,000 
tablets are inspected for broken tablets. What is the mean value and standard deviation of p, where p 
represents the proportion of broken tablets in the samples of 100 selected tablets? 


=e [.o1 x 99 1,000 — 100 
Ans. The mean value of p is .O1 and the standard deviation of Pp is ,J-————— x ,j/-—————_ = 
100 1000-1 


0094. 


A survey reported that 20% of pregnant women smoke, 19% drink , and 13% use crack cocaine or other 
drugs. Assuming the survey results are correct, what is the mean and standard deviation of p. where p is 


the proportion of smokers in samples of 300 pregnant women? 


Ans. The mean value of p is 20% and the standard deviation of p is x — =2.351%. 


SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 
AND THE CENTRAL LIMIT THEOREM 


7.51 


For a sample of size 50, give the range of population proportion values for which p has an approximate 
normal distribution. 


Ans. .l<p<.9 


APPLICATIONS OF THE SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 


7.52 


7.53 


7.54 


Thirty-five percent of the athletes who participated in the 1996 summer Olympics in Atlanta are female. 
What is the probability of randomly selecting a sample of 50 of these athletes in which over half of those 
selected are female? 

Ans. P(p >.50) = P(z > 2.22) = .0132 


It is estimated that 12% of Native Americans have diabetes. What is the probability of randomly selecting 
100 Native Americans and finding 5 or fewer in the hundred whom are diabetic? 


Ans. P(p #.05) = P(z < -2.15) = .0158 


A pair of dice is tossed 180 times. What is the probability that the sum on the faces is equal to 7 on 20% 
or more of the tosses? 


Ans. P(p 2.20)=.1170 


Chapter 8 


Estimation and Sample Size Determination: 
One Population 


POINT ESTIMATE 


Estimation is the assignment of a numerical value to a population parameter or the construction 
of an interval of numerical values likely to contain a population parameter. The value of a sample 
statistic assigned to an unknown population parameter is called a point estimate of the parameter. 


EXAMPLE 8.1 The mean starting salary for 10 randomly selected new graduates with a Masters of Business 
Administration (MBA) at Fortune 500 companies is found to be $56,000. Fifty-six thousand dollars is a point 
estimate of the mean starting salary for all new MBA degree graduates at Fortune 500 companies. The median 
cost for 350 homes selected from across the United States is found to equal $115,000. The value of the sample 
median, $115,000, is a point estimate of the median cost of a home in the United States. A survey of 950 
households finds that 35% have a home computer. Thirty-five percent is a point estimate of the percentage of 
homes that have a home computer. 


INTERVAL ESTIMATE 


In addition to a point estimate, it is desirable to have some idea of the size of the sampling error, 
that is the difference between the population parameter and the point estimate. By utilizing the 
standard error of the sample statistic and its sampling distribution, an interval estimate for the 
population parameter may be developed. A confidence interval is an interval estimate that consists of 
an interval of numbers obtained from the point estimate of the parameter along with a percentage that 
specifies how confident we are that the value of the parameter lies in the interval. The confidence 
percentage is called the confidence level. This chapter is concerned with the techniques for finding 
confidence intervals for population means and population proportions. 


CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES 


According to the central limit theorem, the sample mean, x, has a normal distribution provided 
the sample size is 30 or more. Furthermore, the mean of the sample mean equals the mean of the 
population and the standard error of the mean equals the population standard deviation divided by 
the square root of the sample size. In Chapter 7, the variable given in formula (7.6) (and reproduced 
below) was shown to have a standard normal distribution provided n 2 30. 


x—H 


O*TK 


Z= 


(7.6) 


Since 95% of the area under the standard normal curve is between z = —1.96 and z = 1.96, and 
since the variable in formula (7.6) has a standard normal distribution, we have the result shown in 
formula (8. /) 
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P(-1.96 < 2—# < 1.96) = .95 (8.1) 
Ox 


¥- 


The inequality, -1.96 < < 1.96, is solved for pt and the result is given in formula (8.2). 


Ox 
x — 1.960; << x + 1.966; (8.2) 


The interval given in formula (8.2) is called a 95% confidence interval for the population mean, LL. 
The general form for the interval is shown in formula (8.3), where z represents the proper value from 
the standard normal distribution table as determined by the desired confidence level. 


X -ZOxy <U< K + Z6; (8.3) 


EXAMPLE 8.2 The mean age of policyholders at Mutual Insurance Company is estimated by sampling the 
tecords of 75 policyholders. The standard deviation of ages is known to equal 5.5 years and has not changed 
over the years. However, it is unknown if the mean age has remained constant. The mean age for the sample of 


oO nie 
75 policyholders is 30.5 years. The standard error of the ages is ox = 7== Vis = .635 years. In order to find 
n 


a 90% confidence interval for [, it is necessary to find the value of z in formula (8.3) for confidence level equal 
to 90%. If we let c be the correct value for z, then we are looking for that value of c which satisfies the equation 
P(-c < z < c) = .90. Or, because of the symmetry of the z curve, we are looking for that value of c which 
satisfies the equation P(O < z <c) = .45. From the standard normal distribution table, we find P(O < z < 1.64) = 
4495 and P(O < z < 1.65) = .4505. The interpolated value for c is 1.645, which we round to 1.65. Figure 8-1 
illustrates the confidence level and the corresponding values of z. Now the 90% confidence interval is computed 
as follows. The lower limit of the interval is x- 1.650, = 30.5 — 1.65 x .635 = 29.5 years, and the upper limit is 
x + 1.650; = 31.5. We are 90% confident that the mean age of all 250,000 policyholders is between 29.5 and 
31.5 years. It is important to note that 1 either is or is not between 29.5 and 31.5 years. To say we are 90% 
confident that the mean age of all policyholders is between 29.5 and 31.5 years means that if this study were 
conducted a large number of times and a confidence interval were computed each time, then 90% of all the 
possible confidence intervals would contain the true value of p. 


f(z) 
Total shaded area 
is .90. 


Fig. 8-1 


Since it is time consuming to determine the correct value of z in formula (8.3), the values for the most often 
used confidence levels are given in Table 8.1. They are found in the same manner as illustrated in Example 8.2. 
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Table 8.1 


Confidence level 


EXAMPLE 8.3 The 911! response time for terrorists bomb threats was investigated. No historical data existed 
concerning the standard deviation or mean for the response times. When o is unknown and the sample size is 30 
or more, the standard deviation of the sample itself is used in place of 6 when constructing a confidence interval 
for 1. A sample of 35 response times was obtained and it was found that the sample mean was 8.5 minutes and 
the standard deviation was 4.5 minutes. The estimated standard error of the mean is represented by sx and is 
equal to sx = 7 = ao = .76. From Table 8.1, the value for z is 2.58. The lower limit for the 99% 
confidence interval is K - 2.58 x sx = 8.5 - 2.58 x .76 = 6.54 and the upper limit is 8.5 + 2.58 x .76 = 10.46. 
We are 99% confident that the mean response time is between 6.54 minutes and 10.46 minutes. The true value 


of jt may or may not be between these limits. However, 99% of all such intervals contain yw. It is this fact that 
gives us 99% confidence in the interval from 6.54 to 10.46. 


For large samples, no assumption is made concerning the shape of the population distribution. If 
the population standard deviation is known, it is used in formula (8.3). If the population standard 
deviation is unknown, it is estimated by using the sample standard deviation. The value for z in 
formula (8.3) is found in the standard normal distribution table or, if applicable, by using Table 8.1. 


MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN 


The inequality in formula (8.3) may be expressed as shown: 
|x -p] < zo, (8.4) 


The left-hand side of formula (8.4) is the sampling error when X is used as a point estimate of [L. 
The right-hand side of formula (8.4) is the maximum error of estimate or margin of error when X is 
used as a point estimate of pt. That is, when x is used as a point estimate of [1, the maximum error of 
estimate or margin of error, E, 1s 


E = zo (8.5) 


When the confidence level is 95%, z = 1.96 and E = 1.960;. This value of E, 1.960, is called the 95% 
margin of error or simply margin of error when X is used as a point estimate of . 


EXAMPLE 8.4 The annual college tuition costs for 40 community colleges selected from across the United 
States are given in Table 8.2. The mean for these 40 sample values is $1396, the standard deviation of the 40 


community colleges in the United States is represented by H. A point estimate of 1 is given by $1,396. The 
margin of error associated with this estimate is 1.96 x 104 = $204. The 95% confidence interval for p, based 
upon these data goes from 1396 ~ 204 = $1,192 to 1396 + 204 = $1,600. It is worth noting that the margin of 
error is actually + $204, since the error may occur in either direction. Some publications give the margin of error 
as E and some give it as t E. We shall omit the + sign when giving the margin of error. 
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To compute the confidence interval for 1 using Minitab, set the data given in Table 8.2 into 
column cl and use the command standard deviation cl to compute the sample standard deviation. The 
command zinterval 95% confidence, sigma = 655.44, data in cl uses the data in cl to compute a 95% 
confidence interval for 2. The output is shown below. The confidence interval extends from 1193 to 1599. The 
interval computed in Example 8.4 extends from 1192 to 1600. The difference in the answers is due to rounding. 


MTB > set cl 

DATA> 1200 850 1700 ~=1500 700 #81200 861500 62000 =~ =1950 750 850 
DATA > 3000 2100 500 500 1950 1000 950 560 500 1750 =: 1650 
DATA> 900 2050 ~~ 1780 675 1080 680 900 =1500 930 1640 ~=1320 
DATA> 1750 2500 2310 2900 ~~ 1875 1450 950 

DATA > end 

MTB > standard deviation cl 


Column Standard Deviation 


Standard deviation of Cl = 655.44 
MTB > name c} ‘cost’ 
MTB > zinterval 95% confidence, sigma = 655.44, data in cl 


Confidence Intervals 
The assumed sigma = 655 


Variable N Mean StDev SEMean 95.0%C.I. 
Cost 40 1396 655 104 (1193, 1599) 


The width of a confidence interval is equal to the upper limit of the interval minus the lower limit 
of the interval. In Example 8.4, the width of the 95% confidence interval is 1599 — 1193 = $406. 


THE t DISTRIBUTION 


ae: S., 
When the sample size is less than 30 and the estimated standard error, sx = ae used in place 
n 


of o, in formula (8.3), the width of the confidence interval for p will generally be incorrect. The f 
distribution, also known as the Student-t distribution, is used rather than the standard normal 
distribution to find confidence intervals for 1 when the sample size is less than 30. In this section we 
will discuss the properties of the t distribution, and in the next section, we will discuss the use of the 
t distribution for confidence intervals when the sample size is small. The t distribution is used in 
many different statistical applications. 

The t distribution 1s actually a family of probability distributions. A parameter called the degrees 
of freedom, and represented by df, determines each separate t distribution. The t distribution curves, 
like the standard normal distribution curve, are centered at zero. However, the standard deviation of 
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each t distribution exceeds one and is dependent upon the value of df, whereas the standard deviation 
of the standard normal! distribution equals one. Figure 8-2 compares the standard normal curve with 
the t distribution having 10 degrees of freedom. Generally speaking, the t distribution curves have a 
lower maximum height and thicker tails than the standard normal curve as shown in Fig. 8-2. All t 
distribution curves are symmetrical about zero. 


Fig. 8-2 


Appendix 3 gives right-hand tail areas under t distribution curves for degrees of freedom varying 
from 1 to 40. This table is referred to as the ¢ distribution table. Table 8.3 contains the rows 
corresponding to degrees of freedom equal to 5, 10, 15, 20, and 25 selected from the t distribution 
table in Appendix 3. 


Table 8.3 


| | Area in the right tail under the t distribution curve 
fle 08 025. 01 if 005" | 001 | 


EXAMPLE 8.5 The t distribution having 10 degrees of freedom is shown in Fig. 8-3. To find the shaded area 
in the right-hand tail to the right of t = 1.812, locate the degrees of freedom, 10, under the df column in Table 
8.3. The t value, 1.812, is located under the column labeled .05 in Table 8.3. The shaded area is equal to .05. 
Since the total area under the curve is equal to 1, the area under this curve to the left of t = 1.812 is 1 —- .05= 
.95. The area under this curve between t = —1.812 and t = 1.812 is 1 —.05 ~.05 = .90, since there is .05 to the 
right of t= 1.812 and .0S to the left of t= -1.812. 


f(t) 
: 


| 
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EXAMPLE 8.6 At distribution curve with df = 20 is shown in Fig. 8-4. Suppose we wish to find the shaded 
area under the curve between t = —2.528 and t = 2.086. From Table 8.3, the area to the right of t = 2.086 is .025. 
and the area to the right of t = 2.528 1s .O1. By symmetry, the area to the left of t = —2.528 is also .O1. Since the 
total area under the curve ts 1, the shaded area is | — .025 — .01 = .965. 


f(t) 


CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES 


When a small sample (n < 30) is taken from a normally distributed population and the population 


standard deviation, 6, is unknown, a confidence interval for the population mean, H, is given by 
formula (8.6): 


xX -tsy <U< kK +bsx (8.6) 


In formula (8.6), x is a point estimate of the population mean, sx is the estimated standard error, 
and t 1s determined by the confidence level and the degrees of freedom. The degrees of freedom is 
given by formula (8.7), where n is the sample size. 


df=n-| (8.7) 


EXAMPLE 8.7 The distance traveled, in hundreds of miles by automobile, was determined for 20 individuals 
returning from vacation. The results are given in Table 8.4. 


Table 8.4 


For the data in Table 8.4, X = 12.375, s = 3.741, and sy = 0.837. To find a 90% confidence interval for the 
mean vacation travel distance of all such individuals, we need to find the value of t in formula (8.6). The 
degrees of freedom is df = n- 1 = 20 - 1 = 19. Using df = 19 and the t distribution table, we must find the t 
value for which the area under the curve between —t and t is .90. This means the area to the Ieft of —t is .05 and 
the area to the right of t is also .05. Table 8.5, which ts selected from the ¢ distribution table, indicates that the 
area to the right of 1.729 is .05. This is the proper value of t for the 90% confidence level when df = 19. 


Table 8.5 


Area in the right tail under the t distribution curve 


rs 
| df | tT Sos, |S 
[19 | 1328 1.729 || 2.093" | 2.539 | 2 k6t | 3.579 
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The following technique is also often recommended for finding the t value for a confidence 
interval: To find the t value for a given confidence interval, subtract the confidence level from 1 and 
divide the answer by 2 to find the correct area in the right-hand tail. In this Example, the confidence 
level is 90% or .90. Subtracting .90 from I, we get .10. Now, dividing .10 by 2, we get .05 as the 
right-hand tail area. From Table 8.5, we see that the proper t value is 1.729. Many students prefer 
this technique for finding the proper t value for confidence intervals. 

Using formula (8.6), the lower limit for the 90% confidence interval is x — tsx = 12.375-1.729x 
£837 = 10.928 and the upper limit is K + tsy = 12.375 + 1.729 x .837 = 13.822. A 90% contidence 
interval for 1 extends from 1,093 to 1,382 miles. The distribution of all vacation travel distances are 
assumed to be normally distributed. 

To find a 90% confidence interval for 1 using Minitab, use the set command to enter the data 
given in Table 8.4. The command tinterval 90 percent confidence data in cl is used to find the 
confidence interval. The output is shown below. The confidence interval is (10.928, 13.822). 


MTB > set cl 

DATA > 12.5 8.0 16.0 12.0 95 15.0 50 200 13.0 12.0 10.0 
DATA > 15.0 90 160 120 15.5 95 17.5 75 12.5 

DATA > end 

MTB > name c} ‘distance’ 

MTB > tinterval 90 percent confidence data in cl 


Confidence Intervals 


Variable N Mean St Dev SE Mean 90.0 % C.I. 
Distance 20 12.375 3.74] 0.837 (10.928, 13.822) 


EXAMPLE 8.8 The health-care costs, in thousands of dollars for 20 males aged 75 or over, are shown in 
Table 8.6. Formula (8.6) may not be used to set a confidence interval on the mean health-care cost for such 
individuals, since the sample data indicate that the distribution of such health-care costs is not normally 
distributed. The $515,000 and the $950,000 costs are far removed from the remaining health-care costs. This 
indicates that the distribution of health-care costs for males aged 75 or over is skewed to the right and therefore 
is not normally distributed. Formula (8.6) is applicable only if it is reasonable to assume that the population 
characteristic is normally distributed. 


Table 8.6 


EXAMPLE 8.9 The number of square feet per mall devoted to children’s apparel was determined for 15 malls 
selected across the United States. The summary statistics for these 15 malls are as follows:x = 2,700, s = 450, 
and sx = 116.19. To determine a 99% confidence interval for 1, we need to find the value t where the area 
under the ¢ distribution curve between -t and t is .99 and the degrees of freedom = n—- | = 15 — 1 = 14. Since the 
area between -t and t is .99, the area to the left of -t is .005 and the area to the right of t is .00S. Table 8.7 is 
taken from the t distribution table and shows that the area to the right of 2.977 is .005. The lower limit for the 


99% confidence interval is kK - tsx = 2,700 - 2.977 x 116.19 = 2,354 ft? and the upper limit is K + tsx = 
2,700 + 2.977 x 116.19 = 3,046 ft’. The 99% confidence interval for 1 is (2,354 to 3.046). 
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Table 8.7 
Po Tos | os Tos oC” 


CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION: 
LARGE SAMPLES 


In chapter 7, the variable shown in formula (7./0) and reproduced below was shown to have a 
standard normal distribution provided np > 5 and ng > 5. 


== (7.10) 


Since 95% of the area under the standard normal curve is between -1.96 and 1.96, and since the 
variable in formula (7. /0) has a standard normal distribution, we have the result shown below: 


P-H; 


Op 


P(-1.96< < 1.96) =.95 (8.8) 


From Chapter 7, we also know that W== p and of = sae Substituting for Ws and oF in the tnequality 
n 


p-H5 
G5 


~1.96< 


< 1.96 given in formula (8.8) and solving the resulting inequality for p, we obtain 


pe 196 <p<p+io6,fe (8.9) 
n n 


To obtain numerical values for the lower and upper limits of the confidence interval, it is necessary 
to substitute the sample values p and q for p and q in the expression for o5. Making these 


substitutions, we obtain formula (8. /0). 


pei |=" <p< p+i96 |e (8.10) 
n n 


The interval given in formula (8./0) is called a 95% confidence interval for p. The general form for 
the interval is shown in formula (8.//), where z represents the proper value from the standard normal 
distribution table as determined by the desired confidence level. 


(8.11) 


EXAMPLE 8.10 A study of 75 small-business owners determined that 80% got the money to start their 
business from personal savings or credit cards. If p represents the proportion of all small-business owners who 
got the money to start their business from personal savings or credit cards, then a point estimate of p is Pp = 


80%. The estimated standard error of the proportion is represented by Sp: and is given by Sp = = = 
n 
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—_— = 4.62%. From Table 8.1, the z value for a 99% confidence interval is 2.58. The lower limit for a 
gaa pq a q 
99% confidence interval is p-—z,j—— = 80 - 2.58 x 4.62 = 68.08% and the upper limit is p+ z = 80 + 
n n 


2.58 x 4.62 = 91.92%. Formula (8.//) is considered to be valid if the sample size satisfies the inequalities np> 5 
and nq > 5. Since p and q are unknown, we check the sample size requirement by checking to see ifnp > 5 and 
nq >5.Inthis case np = 75 x .8 = 60 and ng = 75 x .2 = [5, and since both np and ng exceed 5, the sample 
size is large enough to assure the valid use of formula (8. //). 


EXAMPLE 8.11 In a random sample of 40 workplace homicides, it was found that 2 of the 40 were due to a 
personal dispute. A point estimate of the population proportion of workplace homicides due to a personal 


dispute is p = = = .05 or 5%. Formula (8.//) may not be used to set a confidence interval on p, the population 
proportion of homicides due to a personal dispute, since np = 40 x .05 = 2 is less than 5. Note that since p is 


unknown, np cannot be computed. When p is unknown, np and ng are computed to determine the appropriate- 
ness of using formula (8. //). 


DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION 
OF THE POPULATION MEAN 


The maximum error of estimate when using Xas a point estimate of tt is defined by formula (8.5) 
and is restated below. 


E = Zox (8.5) 
If the maximum error of estimate is specified, then the sample size necessary to give this maximum 


o 

error may be determined from formula (8.5). Replacing oj by F to obtain E = oe and then 
n n 

solving for n we obtain formula (8./2). The value obtained for n is always rounded up to the next 

whole number. This is a conservative approach and sometimes results in a sample size larger than 

actually needed. 


(8.1/2) 


EXAMPLE 8.12 In Example 8.4, the mean of a sample of 40 tuition costs for community colleges was found 
to equal to $1,396 and the standard deviation was $655. The margin of error associated with using $1,396 as a 
point estimate of 1 is equal to $204. To reduce the margin of error to $100, a larger sample is needed. The 


1.96" * 655° 


required sample size is n= ; = 164.8. To obtain a conservative estimate, we round the estimate up to 
00 


165. Note that we estimated o by s, because o is unknown. The 40 tuition costs obtained in the original survey 
would be supplemented by 165 — 40 = 125 additional community colleges. The new estimate based on the 165 
tuition costs would have a margin of error of $100. 


To use formula (8./2) to determine the sample size, either o or an estimate of 6 is needed. In 
Example 8.12, the original study, based on sample size 40, may be regarded as a pilot sample. When 
a pilot sample is taken, the sample standard deviation is used in place of 6. In other instances, a 
historical value of oO may exist. In some instances, the maximum and minimum value of the 
characteristic being studied may be known and an estimate of 6 may be obtained by dividing the 
range by 4. 
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EXAMPLE 8.13 A sociologist desires to estimate the mean age of teenagers having abortions. She wishes to 
estimate 1 with a 99% confidence interval so that the maximum error of estimate is E = .1 years. The z value is 
2.58. The range of ages for teenagers is 19-13 = 6 years. A rough estimate of the standard deviation is obtained 
by dividing the range by 4 to obtain 1.5 years. This method for estimating O works best for mound-shaped 


ye ee Tae 258? x15? asc 
distributions. The approximate sample size is n = ———-——. = 1497.7. Rounding up, we obtain n = 1,498. 


a 


DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION 
OF THE POPULATION PROPORTION 


When f is used as a point estimate of p, the maximum error of estimate is given by 
E = Z05 (8.13) 
If the maximum error of estimate is specified, then the sample size necessary to give this maximum 
error may be determined from formula (8./3). If og is replaced by PO and the resultant equation is 


n 
solved for n, we obtain 


n=— (8./4) 
E 

Since p and q are usually unknown, they must be estimated when formula (8./4) is used to 

determine a sample size to give a specified maximum error of estimate. If a reasonable estimate of p 

and q exists, then the estimate is used in the formula. If no reasonable estimate is known, then both p 

and q are replaced by .5. This gives a conservative estimate for n. That is, replacing p and q by .5 


usually gives a larger sample size than is needed, but it covers all cases so to speak. 


EXAMPLE 8.14 A study is undertaken to obtain a precise estimate of the proportion of diabetics in the United 

States. Estimates ranging from 2% to 5% are found in various publications. The sample size necessary to 

1967 x 05 x 95 
001 

When a range of possible values for p exists, as in this problem, use the value closest to .5 as a reasonable 

estimate for p. In this example, the value in the range from .02 to .05 closest to .5 is .0S. If the prior estimate of 


7x 5x 
p were not used, and .5 is used for p and q, then the computed sample size is n = 196" Oe = 960,400. 


2 
O01 
Notice that using the prior information concerning p makes a tremendous difference in this example. 


estimate the population percentage to within 0.1% with 95% confidence is n = = 182,476. 


Solved Problems 


POINT ESTIMATE 


8.1. The mean annual salary for public school teachers is $32,000. The mean salary for a sample of 
750 public school teachers equals $31,895. Identify the population, the population mean, and 
the sample mean. Identify the parameter and the point estimate of the parameter. 


Ans. The population consists of all public school teachers. 1 = $32,000, x = $31,895. The parameter is 
Lu. A point estimate of the parameter [Lt is $31,895. 
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8.2 Ninety-eight percent of U.S. homes have a TV. A survey of 2000 homes finds that 1,925 have 
a TV. Identify the population, the population proportion, and the sample proportion. Identify 


the parameter and a point estimate of the parameter. 


Ans. The population consists of all U.S. homes. p = 98%, p =96.25%. The parameter is p. A point 


estimate of p is 96.25%, 


INTERVAL ESTIMATE 


8.3. When a multiple of the standard error of a point estimate is subtracted from the point estimate 
to obtain a lower limit and added to the point estimate to obtain an upper limit, what term is 


used to describe the numbers between the lower limit and the upper limit? 


Ans. interval estimate 


CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES 


Table 8.8 


8.4 A sample of 50 taxpayers receiving tax refunds is shown in Table 8.8. Find an 80% confidence 
interval for 11, where pt represents the mean refund for all taxpayers receiving a refund. 


Ans. For the data in Table 8.8, £ x = 51,685, Z x? = 70,158,336, x= 1033.7, s = 584.3, sx = 82.6. 
From Table 8.1, the z value for 80% confidence is 1.28. Lower limit = 1033.7 — 1.28 x 82.6 = 
927.97, upper limit = 1033.7 + 1.28 x 82.6 = 1139.43. The 80% confidence interval is (927.97, 


1139.43). 
8.5 Use Minitab to find the 80% confidence interval for 1 in problem 8.4. 


Ans. MTB >setcl 
DATA> 1515 1432 120 1270 312 1904 1662 1857 903 
DATA> 723 1212 202 1118 1726 1562 1518 1671 395 
DATA> 868 1236 1631 681 1243 392 1623 169 589 
DATA> 1769 232 171 271 1062 1023 800 762 364 
DATA> 1781 313 1067 1367 283 1973 
DATA > end 
MTB > name c! 'refund' 
MTB > standard deviation cl 


1313 
1744 
1901 
1396 


959 
764 
168 
668 
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Standard deviation of refund = 584.35 
MTB > zinterval 80 percent confidence sigma = 584.35 data in cl 


Confidence Intervals 
The assumed sigma = 584 


Variable N Mean StDev SE Mean 80.0% CL 
Refund 50 1033.7 584.3 82.6 (927.8, 1139.6) 


MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN 


8.6 A sample of size 50 is selected from a population having a standard deviation equal to 4. Find 
the maximum error of estimate associated with the confidence intervals having the following 
confidence levels: (a) 80%; (b) 90%; (c) 95%; (d) 99%. 

g.. <& ; 
Ans. The standard error of the mean is Tea = .566. Using the z values given in Table 8.1, the 
2) 
following maximum errors of estimate are obtained, 
(a) E=1.28x .566=.724 (b) E=1.65 x .566 = .934 
(c) E= 1.96 x 566=1.109 (d) E=2.58 x .566= 1.460 

8.7 Describe the effect of the following on the maximum error of estimate when determining a 
confidence interval for the mean: (a) sample size; (b) confidence level; (c) variability of the 
characteristic being measured 
Ans. (a) The maximum error of estimate decreases when the sample size is increased. 

(b) The maximum error of estimate increases if the confidence level is increased. 
(c) The larger the variability of the characteristic being measured, the larger the maximum error 
of estimate. 

THE t DISTRIBUTION 

8.8 Table 8.9 contains the row corresponding to 7 degrees of freedom taken from the t distribution 
table in Appendix 3. For a t distribution curve with df = 7, find the following: 

(a) Area under the curve to the right of t = 2.998 
(b) Area under the curve to the left of t = -2.998 
(c) Area under the curve between t = -2.998 and t = 2.998 
Table 8.9 
he eee) oat Area in the right tail under the t distribution curve 
aC Cc 
p74 | 1895 | 2.365 | 2.998 | 3.499 
Ans. (a) .01 
(b) Since the curve is symmetrical about 0, the answer is the same as in part (a), .01 
(c) Since the total area under the curve is 1, the answer is 1 — .01 — .01 = .98 
8.9 Refer to problem 8.8 and Table 8.9 to find the following. 


(a) Find the t value for which the area under the curve to the right of t is .05. 

(b) Find the t value for which the area under the curve to the left of t is .0S. 

(c) By considering the answers to parts (a) and (5), find that positive t value for which the 
area between -t and t is equal to .90. 
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Ans. (a) 1.895 (b)-1.895 (c) 1.895 


CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES 


8.10 In a transportation study, 20 cities were randomly selected from all cities having a population 
of 50,000 or more. The number of cars per 1,000 people was determined for each selected city, 
and the results are shown in Table 8.10. Find a 95% confidence interval for u, where Lt is the 
mean number of cars per 1,000 people for all cities having a population of 50,000 or more. 
What assumption is necessary for the confidence interval to be valid? 


Table 8.10 


Ans. For the data in Table 8.10, £ x = 10,092, £ x? = 5,381,738, X= 504.6, s = 123.4, sx = 27.6. The 
t distribution is used because the sample is small. The degrees of freedom is df = 20 - | = 19. The 
t value in the confidence interval is determined from Table 8.11, which is taken from the t 
distribution table in Appendix 3. From Table 8.11, we see that the area to the right of 2.093 is .025 
and the area to the left of -2.093 is .025, and therefore the area between —2.093 and 2.093 is .95. 
The lower limit of the 95% confidence interval is x - tsx = 504.6 — 2.093 x 27.6 = 446.8 and the 
upper limit is x + tsy = 504.6 + 2.093 x 27.6 = 562.4. The 95% confidence interval is (446.8, 
562.4). It is assumed that the distribution of the number of cars per 1,000 people in cities of 
50,000 or over is normally distributed. 


Table 8.11 


Mens en Area in the right tail under the t distribution curve 
SE YT OS 25 Ot 3 008s 001 
2.861_| 3.579 


8.11 Find the Minitab solution to problem 8.10. 


Ans. MTB > Setcl 
DATA> 409 663 304 535 628 487 676 565 554 308 480 
DATA> 494 670 515 319 535 332 434 665 519 
DATA > end 
MTB > name cl ‘cars’ 
MTB > tinterval 95 percent confidence data in cl 
Confidence Intervals 
Variable N Mean StDev SEMean 95.0%@CL1. 
Cars 20 504.6 123.4 27.6 (446.8, 562.4) 


CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION: 
LARGE SAMPLES 


8.12 A national survey of 1200 adults found that 450 of those surveyed were pro-choice on the 
abortion issue. Find a 95% confidence interval for p, the proportion of all adults who are pro- 
choice on the abortion issue. 
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8.13 


Ans. The sample proportion who are pro-choice is p = a = .375 and the proportion who are not pro- 
750 


choice or undecided is 4 = 7355 = .625. The estimated standard error of the proportion is 


[2x3 = ———— = 014. The z value is 1.96. The lower limit of the 95% confidence 
n 


interval is P- 2fPX4 ~ 375 — 1.96 x .014 = 348 and the upper limit is p+2,/2—4 = .375 + 
n 


n 


1.96 x .014 = .402. The 95% confidence interval is (.348, .402). 


A national survey of 500 African-Americans found that 62% in the survey favored affirmative 
action programs. Find a 90% confidence interval for p, the proportion of all African-Americans 
who favor affirmative action programs. 


Ans. The sample percent who favor affirmative action programs is 62%. The sample percent who do not 
favor affirmative action programs or are undecided is 38%. The estimated standard error of 


[p q | 2 
proportion, expressed as a percentage, is Lea ee — = 2.17%. The z value is 1.65. The 
n 


pxq 
n 


= 62 - 1.65 x 2.17 = 58.4% and the 


lower limit of the 90% confidence interval is p—z 


upper limit is p+z io 62 + 1.65 x 2.17 = 65.6%. The 90% confidence interval is (58.4%, 
n 


65.6%), 


DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION 
OF THE POPULATION MEAN 


8.14 


8.15 


A machine fills containers with corn meal. The machine is set to put 680 grams in each 
container on the average. The standard deviation is equal to 0.5 gram. The average fill ts 
known to shift from time to time. However, the variability remains constant. That is, 6 remains 
constant at 0.5 gram. In order to estimate 1, how many containers should be selected from a 
large production run so that the maximum error of estimate equal 0.2 gram with probability 
0.95? 


223 
hee zo : 
Ans. The sample size is determined by use of the formula n = BR The value for o is known to equal 


0.5, E is specified to be 0.2 and for probability equal to 0.95, the z value is 1.96. Therefore, the 
ZO” _ 196°x 5" 


sample size is n = 2 = Saat = 24.01. The required sample size is obtained by rounding 


the answer up to 25. 


A pilot study of 250 individuals found that the mean annual health-care cost per person was 
$2550 and the standard deviation was $1350. How large a sample is needed to estimate the true 
annual health-care cost with a maximum error of estimate equal to $100 with probability equal 
to 0.99? 


rer 
zo 
Ans. The sample size is determined by use of the formula n = E? The value for 6 is estimated from 


the pilot study to be $1350. E is specified to be $100 and for probability equal to 0.99, the z value 
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z'O" _ 258° 1350° 


FE? 1007 
size 1s obtained by rounding the answer up to 1214. If the results from the pilot study are used, 
then an additional 1214 — 250 = 964 individuals need to be surveyed. 


is 2.58. Therefore, the sample size is n = = 1213.13. The required sample 


DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION 
OF THE POPULATION PROPORTION 


8.16 A large study is undertaken to estimate the percentage of students in grades 9 through 12 who 
use cigarettes. Other studies have indicated that the percent ranges between 30% and 35%. 
How large a sample ts needed in order to estimate the true percentage for all such students with 
a maximum error of estimate equal to 0.5% with a probability of 0.90? 


Ans. 


72 


The sample size is given by n = The specified value for E is 0.5%. The 2 value for 


probability equal to 0.90 is 1.65. The previous studies indicate that p is between 30% and 35%. 
When a range of likely values for p exist, the value closest to 50% is used. This value gives a 
1.657 X 35x65 


conservative estimate of the sample size. The sample size is n = ; = 24,774.75. The 
5 


sample size is 24,775. 


8.17 Find the sample size in problem 8.16 if no prior estimate of the population proportion exists. 


2 


z 
Ans. The sample size is given by n = m . The specified value for E is 0.5%. The 2 value for 
E 
probability equal to 0.90 is 1.65. When no prior estimate for p exists, p and q are set equal to 50% 
7 
-x50x 50 

in the sample size formula. Therefore, oS = 27,225. Note that 2,450 fewer subjects are 

5 

needed when the prior estimate of 35% is used. 

Supplementary Problems 
POINT ESTIMATE 


8.18 Table 8.8 in problem 8.4 contains a sample of 50 tax refunds received by taxpayers. Give point estimates 
for the population mean, population median, and the population standard deviation. 


8.19 


Ans. 


$1033.70, the mean of the 50 tax refunds, is a point estimate of the population mean. 

$1064.50, the median of the 50 tax refunds, is a point estimate of the population median. 

$584.30, the standard deviation of the 50 tax refunds, is a point estimate of the population standard 
deviation. 


Table 8.10 in problem 8.10 contains a sample of the number of cars per 1,000 people for 20 cities having 
a population of 50,000 or more. Give a point estimate of the proportion of all such cities with 600 or 
more cars per 1,000 people. 


ANS. 


In this sample, there are 5 cities with 600 or more cars per 1,000. The point estimate of p is .25 or 
25%. 
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INTERVAL ESTIMATE 


8.20 


What happens to the width of an interval estimate when the sample size is increased? 


Ans. The width of an interval estimate decreases when the sample size is increased. 


CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES 


8.21 


8.22 


One hundred subjects in a psychological study had a mean score of 35 on a test instrument designed to 
measure anger. The standard deviation of the 100 test scores was 10. Find a 99% confidence interval for 
the mean anger score of the population from which the sample was selected. 


Ans. The confidence interval is X - zo; << XK + 70,. 
The lower limit is X- zoz = 35 - 2.58 x | = 32.42 and the upper limit is x - zog = 35+2.58x ]= 
37.58. The interval is (32.42, 37.58). 


The width of a large sample confidence interval is equal to w=(X + Z0z) - (X— 20%) = 2z0;. Replacing 


oO 
ees 


and the sample size on the width. 


Oo ” nol . . 
. W is given as w = 2z——. Discuss the effect of the confidence level. the standard deviation, 


vn 


Ans. The width varies directly as the confidence level, directly as the variability of the characteristic 
being measured, and inversely as the square root of the sample size. 


MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN 


8.23. The U.S. abortion rate per 1,000 female residents in the age group 15 — 44 was determined for 35 
different cities having a population of 25,000 or more across the U.S. The mean for the 35 cities was 
equal to 24.5 and the standard deviation was equal to 4.5. What is the maximum crror of estimate when a 
90% confidence interval is constructed for p? 
Ans. The maximum error of estimate ts E = zo;. For a 90% confidence interval, z = 1.65. The estimated 
standard error is 0.76. E = 1.65 x .76 = 1.25. 
8.24 A study concerning one-way commuting times to work was conducted in Omaha, Nebraska. The mean 
commuting time for 300 randomly selected workers was 45 minutes. The margin of error for the study 
was 7 minutes. What is the 95% confidence interval for 2? 
Ans. The 95% confidence interval is X — 1.960; <p < x + 1.966,;. The margin of error is E = 1.960, 
= 7 and therefore, the lower limit of the interval is 38 and the upper limit is 52. 
THE T DISTRIBUTION 
8.25 Use the t distribution table and the standard normal distribution table to find the values for a, b, c, and d. 


What distribution does the t distribution approach as the degrees of freedom increases? 


Table 8.12 


Area under the curve to the 
Distribution type right of this value is .025 


t distribution, df = 10 a 


t distribution, df = 20 b 
t distribution, df = 30 c 
standard normal d 
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Ans. a=2.228 b=2.086 c=2.042 d=1.96 
The t distribution approaches the standard normal distribution when the degrees of freedom are 
increased. This is illustrated by the results in Table 8.12. 


dt 
The standard deviation of the t distribution is equal to 7 a for df > 2. Find the standard deviation 


for the t distributions having df values equal to 5,10, 20, 30, 50, and 100. What value is the standard 
deviation getting close to when the degrees of freedom get larger? 


Ans. The standard deviations are shown in Table 8.13. The standard deviation is approaching one as the 
degrees of freedom get larger. The t distribution approaches the standard normal! distribution when 
the degrees of freedom are increased. 


Table 8.13 


Standard deviation of the t 
df distribution with this value for df 


CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES 


8.27 


8.28 


In order to estimate the lifetime of a new bulb, ten were tested and the mean lifetime was equal to 835 
hours with a standard deviation equal to 45 hours. A stem-and-leaf of the lifetimes indicated that it was 
reasonable to assume that lifetimes were normally distributed. Determine a 99% confidence interval for 
ut, where pt represents the mean lifetime of the population of lifetimes. 


Ans. The confidence interval is k -tsxy <H< XK + (sx, where x = 835,5 = 45, sy = 14.23, andt = 
3.250. The lower limit is x — tsx = 835 - 46.25 = 788.75 hours and the upper limit is x + tsy = 
835 + 46.25 = 881.25 hours. 


The heights of 15 randomly selected buildings in Chicago are given in Table 8.14. Find a 90% 
confidence interval on the mean height of buildings in Chicago. 


Table 8.14 


Ans. A Minitab histogram of the data in Table 8.14 is shown in Fig. 8-5. Since the heights are not 
normally distributed, the confidence interval using the t distribution is not appropriate. 
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CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION: LARGE SAMPLES 


8.29 


8.30 


In a survey of 900 adults, 360 responded “yes” to the question “Have you attended a major league 
baseball game in the last year?” Determine a 90% confidence interval for p, the proportion of all adults 
who attended a major league baseball game in the last year. 


Ans. The sample proportion attending a game in the last year is p = .40 and the proportion not 

attending is q = .60. The z value for a 90% confidence interval is 1.65. The confidence interval 

ae <p<pt 2? . The maximum error of estimate is E = z a = .027. The 
n n 


for pis p—z 


lower limit of the confidence interval is .40 - .027 = .373 and the upper limit is .40 +.027 = .427. 
The 90% interval is (.373, .427). 


Use the data in Table 8.8, found in problem 8.4, to find a 99% confidence interval for the proportion of 
all taxpayers receiving a refund who receive a refund of more than $500. 


Ans. Thirty-seven of the fifty refunds in Table 8.8 exceed $500. The sample proportion exceeding $500 
is .74. The z value for a 99% confidence interval is 2.58. The confidence interval for p ts 
p- 2-4 <p< ptz Et. The maximum error of estimate is E = zo - .16 The lower 
n n n 
limit is .74 — .16 = .58 and the upper limit is .74 + .16 = .90. The 99% interval is (.58, .90). 


DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION OF THE POPULATION MEAN 


8.31 


8.32 


The estimated standard deviation of commuting distances for workers in a large city is determined to be 3 
miles in a pilot study. How large a sample is needed to estimate the mean commuting distance of all 
workers in the city to within .5 mile with 95% confidence? 


2-2 
Ans. The sample size is given by n= . For this problem, E = .5, z = 1.96, and G Is estimated to be 


3. The sample size is determined to be 139. 
In problem 8.31, what size sample 1s required to estimate 1. to within .1 mile with 95% confidence? 


Ans. n= 3,458 
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DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION OF THE 
POPULATION PROPORTION 


8.33 


8.34 


What sample size is required to estimate the proportion of adults who wear a beeper to within 3% with 
probability 0.95? A survey taken one year ago indicated that 15% of all adults wore a beeper. 


2 


4 . The error of estimate, E, is given to be 3%, and the z value 
E 


Ans. The sample size is given by n = 


— ?x 15x 
is 1.96. If we assume that the current proportion is “close” to 15%, then n = aa a = 
544.2, which ts rounded up to 545. 


Suppose in problem 8.33 that there is no previous estimate of the population proportion. What sample 
size would be required to estimate p to within 3% with probability 0.95? 


2 


Ans. The sample size its given by n= The error of estimate, E, is given to be 3%, and the z value 


E2 


2x 50x50 
is 1.96. With no previous estimate for p, use p = q = 50%. n= ar a a = 1,067.1, which is 


rounded up to 1,068. 


Chapter 9 


Tests of Hypotheses: One Population 


NULL HYPOTHESIS AND ALTERNATIVE HYPOTHESIS 


There are a number of mail order companies in the U.S. Consider such a company called L. C. 
Stephens Inc. The mean sales per order for L. C. Stephens, determined from a large database last 
year, is known to equal $155. From the same database, the standard deviation is determined to be o = 
$50. The company wishes to determine whether the mean sales per order for the current year is 
different from the historical mean of $155. If represents the mean sales per order for the current 
year, then the company is interested in determining if 1 # $155. The alternative hypothesis, also 
often called the research hypothesis, is related to the purpose of the research. In this instance, the 
purpose of the research is to determine whether pp. # $155. The research hypothesis is represented 
symbolically by H,: p # $155. The negation of the research hypothesis is called the null hypothesis. 
In this case, the negation of the research hypothesis is that f = $155. The null hypothesis is 
represented symbolically by Ho: 4 = $155. The company wishes to conduct a test of hypothesis to 
determine which of the two hypotheses is true. A test of hypothesis where the research hypothesis is 
of the form H, : 4: # (some constant) is called a two-tailed test. The reason for this term will be clear 
when the details of the test procedure are discussed. 


EXAMPLE 9.1 A tire manufacturer claims that the mean mileage for their premium brand tire is 60,000 miles. 
A consumer organization doubts the claim and decides to test it. That ts, the purpose of the research from the 
perspective of the consumer organization ts to test if 1 < 60,000, where «t represents the mean mileage of all 
such tires. The research hypothesis is stated symbolically as H,: p < 60,000 miles. The negation of the research 
hypothesis is 4 = 60,000. The null hypothesis is stated symbolically as Ho: p = 60,000. A test of hypothesis 
where the research hypothesis is of the form H,: p < (some constant) is called a one-tailed test, a left-tatled test, 
or a lower-tailed test. 


EXAMPLE 9.2 The National Football League (NFL) claims that the average cost for a family of four to attend 
an NFL game is $200. This figure includes ticket prices and snacks. A sports magazine feels that this figure is 
too low and plans to perform a test of hypothesis concerning the claim. The research hypothesis is H,: u > $200, 
where p1 is the mean cost for all such families attending an NFL game. The null hypothesis is Hp: # = $200. A 
test of hypothesis where the research hypothesis is of the form H,: 1 > (some constant) is called a one-tailed 
test, a right-tailed test, or an upper-tailed test. 


TEST STATISTIC, CRITICAL VALUES, REJECTION, AND 
NONREJECTION REGIONS 


Consider again the test of hypothesis to be performed by the L. C. Stephens mail order company 
discussed in the previous section. The two hypotheses are restated as follows: 


Ho: # = $155 (the mean sales per order this year is $155) 
H,: HW # $155 (the mean sales per order this year is not $155) 
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To test this hypothesis, a sample of 100 orders for the current year are selected and the mean of the 
sample is used to decide whether to reject the null hypothesis or not. A statistic, which is used to 
decide whether to reject the null hypothesis or not is called a test statistic. The central limit theorem 
assures us that if the null hypothesis is true, then x will equal $155 on the average and the standard 


error will equal ox = 1a = 5, assuming that the standard deviation equals 50 and has remained 
constant. Furthermore, the sample mean has a normal distribution. If the value of the test statistic, x, 
is “close” to $155, then we would likely not reject the null hypothesis. However, if the computed 
value of x is considerably different from $155, we would reject the null hypothesis since this 
outcome supports the truth of the research hypothesis. The critical decision is how different does x 
need to be from $155 in order to reject the null hypothesis. Suppose we decide to reject the null 
hypothesis if x differs from $155 by two standard errors or more. Two standard errors is equal to 2 x 
ox = $10. Since the distribution of the sample mean is bell-shaped, the empirical rule assures us that 
the probability that x differs from $155 by two standard errors or more is approximately equal to 
0.05. That is, if * < $145 or if ¥ > $165, then we will reject the null hypothesis. Otherwise, we will 
not reject the null hypothesis. The values $145 and $165 are called critical values. The critical 
values divide the possible values of the test statistic into two regions called the rejection region and 
the nonrejection region. The critical values along with the regions are shown in Fig. 9-1. 


| 


rejection region nonrejection region rejection region 
145 165 


Fig. 9-1 
EXAMPLE 9.3 In Example 9.1, the null and research hypothesis are stated as follows: 


Ho: 1. = 60,000 miles (the tire manufacturer’s claim is correct) 
H,: pt < 60,000 miles (the tire manufacturer’s claim is false) 


Suppose it is known that the standard deviation of tire mileages for this type of tire is 7,000 miles. If a sample of 


49 of the tires are road tested for mileage and the manufacturer’s claim is correct, then x will equal 60,000 
7, 
miles on the average and the standard error will equal ox = Ficy = 1,000 miles. Furthermore, because of the 


large sample size, the sample mean will have a normal distribution. Suppose the critical value is chosen two 
standard errors below 60,000; then the rejection and nonrejection regions will be as shown in Fig. 9-2. 


rejection region | _nonrejection region x 
58,000 


Fig. 9-2 


When the consumer organization tests the 49 tires, if the mean mileage is less than 58,000 miles for the sample, 
then the manufacturer’s claim will be rejected. 


EXAMPLE 9.4 In Example 9.2, the null hypothesis and the research hypothesis are stated as follows: 


Ho: 1. = $200 (the NFL claim is correct) 
H,: # > $200 (the NFL claim is not correct) 


Suppose the standard deviation of the cost for a family of four to attend an NFL game is known to equal $30. If 
a sample of 36 costs for families of size 4 are obtained, then x will be normally distributed with mean equal to 
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30 ; 
$200 and standard error equal to Te = $5 if the NFL claim is correct. Suppose the critical value is chosen 
36 


three standard errors above $200; then the rejection and nonrejection regions are as shown in Fig. 9-3. 


»| 


nonrejection region | rejection region 
215 


Fig. 9-3 


If the mean of the sample of costs for the 36 families of size 4 exceeds $215, then the NFL claim will be 
rejected. 


TYPE I AND TYPE II ERRORS 


Consider again the test of hypothesis to be performed by the L. C. Stephens mail order company 
discussed in the previous two sections. The two hypotheses are restated as follows: 


Ho: H = $155 (the mean sales per order this year is $155) 
H,: # # $155 (the mean sales per order this year is not $155) 


The decision to reject or not reject the null hypothesis is to be based on the sample mean. The critical 
values and the rejection and nonrejection regions are given in Fig. 9-1. If the current year mean, H, is 
$155 but the sample mean falls in the rejection region, resulting in the null hypothesis being rejected, 
then a Type | error is made. If the current year mean is not equal to $155, but the sample mean falls 
in the nonrejection region, resulting in the null hypothesis not being rejected, then a Type II error is 
made. That is, if the null hypothesis is true but the statistical test results in rejection of the null, then 
a Type I error occurs. If the null hypothesis is false but the statistical test results in not rejecting the 
null, then a Type II error occurs. The errors as well as the possible correct conclusions are 
summarized in Table 9.1. 


Table 9.1 


Do not reject Ho Correct conclusion Type II error 
Reject Hy Type I error Correct conclusion 


The first two letters of the Greek alphabet are used in statistics to represent the probabilities of 
committing the two types of errors. The definitions are as follows. 


a= probability of making a Type I error 
8 = probability of making a Type I] error 


The calculation of a Type J error will now be illustrated, and the calculation of Type II errors will be 
illustrated in a later section. The term level of significance is also used for @. 


The level of significance, a, is defined by formula (9. /): 
a = P(rejecting the null hypothesis when the null hypothesis is true) (9.1) 


In the current discussion of the L. C. Stephens mail order company, the level of significance may be 
expressed as & = P(x < 145 or x > 165 and 1 = 155). In order to evaluate this probability, recall that 
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x-y 


Ox 


for large samples, z = has a standard normal distribution. As previously discussed, the standard 


: ; Roane x -155 
error is $5 and assuming that the null hypothesis is true, z = has a standard normal 
5 


x- 155 145— 155 
, 
5 


distribution, The event x < 145 is equivalent to the event = -2 or z < -2. The 


5 

x—- 155 165-155 
> inne 
5 

expressed as @ = P(x < 145 ) + P(X > 165) = P(z < -2) + P(z > 2). From the standard normal 
distribution table, P(z < —2) = P(z > 2) = .5 — .4772 = .0228. And therefore, @ = 2 x .0228 = .0456. 
When using the mean of a sample of 100 order sales and the rejection and nonrejection regions 
shown in Fig. 9-1 to decide whether the overall mean sales per order has changed, 4.56% of the time 
the company will conclude that the mean has changed when, in fact, it has not. 


event x > 165 is equivalent to the event = 2 or z > 2. Therefore, @ may be 


EXAMPLE 9.5 In Example 9.3, the null and research hypothesis for the tire manufacturer’s claim is as 
follows: 


Hy: . = 60,000 miles (the tire manufacturer’s claim is correct) 
H,: Ht < 60,000 miles (the tire manufacturer's claim is false) 


The rejection and nonrejection regions were illustrated in Fig. 9-2. The level of significance is determined as 
follows: 


a = P(rejecting the null hypothesis when the null hypothesis is true) 
Since the null is rejected when the sample mean is less than 58,000 miles and the null is true when pt = 60,000, 
a= P(x < 58,000 when pt = 60,000) 
Transforming the sample mean to a standard normal, the expression for & now becomes 


ses p (¥=60.000 -58,000- 60,000) 
. 1,000 1,000 


Simplifying, and finding the area to the left of ~2 under the standard normal curve, we find 

a= P(z < -2) = .5 ~ .4772 = .0228 
If the tire manufacturer’s claim is tested by determining the mileages for 49 tires and rejecting the claim when 
the mean mileage for the sample is Jess than 58,000 miles, then the probability of rejecting the manufacturer's 
claim, when it ts correct, equals .0228. 


EXAMPLE 9.6 In Example 9.4, the null and research hypothesis for the NFL’s claim is as follows: 


Hy: # = $200 (the NFL claim 1s correct) 
H,: » > $200 (the NFL claim ts not correct) 


The rejection and nonrejection regions were illustrated in Fig. 9-3. The level of significance is determined as 
follows: 


o = P(rejecting the null hypothesis when the null hypothesis is true) 


Since the null is rejected when the sample mean is greater than $215 and the null is true when 1 = $200, 
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a= P(x > 215 when up = 200) 
Transforming the sample mean to a standard normal, the expression for &@ now becomes 


a= P(A . — =) 


Simplifying, and finding the area to the right of 3 under the standard normal curve, we find 
= P(z > 3) = .5 — .4986 = .0014 


The probability of falsely rejecting the NFL’s claim when using a sample of 36 families and the rejection and 
nonrejection regions shown in Fig. 9-3 to test the hypothesis is .0014. 


Thus far, we have considered how to find the level of significance when the rejection and 
nonrejection regions are specified. Often, the level of significance is specified and the rejection and 
nonrejection regions are required. Suppose in the mail order example that the level of significance is 
specified to be @ = .01. Because the hypothesis test is two-tailed, the level of significance is divided 
into two halves equal to .005 each. We need to find the value a, where P(z > a) = .005 and P(z < -a)= 
.005. The probability P(z > a) = .005 implies that P(O < z <a) = .5 — .005 = .495. Using the standard 
normal table, we see that P(O < z < 2.57) = .4949 and P(0 < z < 2.58) = .4951. The interpolated value 
is a = 2.575, which we round to 2.58. The rejection region is therefore as follows: z < -2.58 or 
z > 2.58. Figure 9-4 shows the rejection regions as well as the area under the standard normal curve 
associated with the regions. 


rejection region 
Area = .005 


rejection region 
Area = .005S 


-2.58 i 2.58 


Fig, 9-4 
, a ; : = : x= 155 
To determine the rejection region in terms of x, the equation z = ——— may be used. The 
5 
x- 155 


inequality z < -2.58 is replaced by < -2.58. Multiplying both sides of the inequality by 5 and 


x-155 


then adding 155 to both sides, the inequality < -2.58 is seen to be equivalent to x < 142.1. 


x- 155 


The inequality z > 2.58 is replaced by > 2.58 which is equivalent to x > 167.9. Figure 9-5 


shows the rejection regions as well as the area under the normal curve associated with the sample 
means. 
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rejection region 
Area = .005 


rejection region 
Area = .005 


142.1 167.9 


Fig. 9-5 


EXAMPLE 9.7 Suppose we wish to test the hypothesis in Example 9.5 at a level of significance equal to .01. 
Since this is a lower-tailed test, we need to find a standard normal value a such that P(z < a) = .OI. From the 
standard normal distribution table, we find that P(O < z < 2.33) = .4901. Therefore P(z > 2.33) = .01. Because of 
the symmetry of the standard normal curve, P(z < —2.33) also equals .O1. Hence z < —2.33 is a rejection region 
of size & = .01. To find the rejection region in terms of the sample mean, we note from Example 9.5 that if the 


x — 60,000 
null hypothesis is true, then z= rea has a standard normal distribution. Substituting for z in the 
l, 
x — 60,000 es _ — 
inequality z < —2.33, we have ————— < -2.33 or solving for x, we find x < 57,670 miles as the rejection 


region. Figure 9-6 shows the rejection region as well as the area under the standard normal curve associated with 
the region. 


rejection region 
Area = .01 


Fig. 9-6 


Figure 9-7 shows the rejection region as well as the area under the normal curve associated with the sample 
means. 


rejection region 
Area = .01 


*| 


57,670 


Fig. 9-7 


EXAMPLE 9.8 Suppose we wish to test the hypothesis in Example 9.6 at a level of significance equal to .10. 
Since this is an upper-tailed test, we need to find a standard normal value a such that P(z > a) = .10. From the 
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standard normal distribution table, we find that P(O < z < 1.28) = .3997. Therefore P(z > 1.28) = .10. Hence 


z > 1.28 is a rejection region of size @ = .10. To find the rejection region in terms of the sample mean, we note 
x — 200 


from Example 9.6 that if the null hypothesis is true, then z = has a standard normal distribution. 


x — 200 


Substituting for z in the inequality z > 1.28, we have > 1.28 or solving for x, we find x > $206.40 as 


the rejection region. Figure 9-8 shows the rejection region as well as the area under the standard normal curve 
associated with the region. 


rejection region 
Area = .10 


Fig. 9-8 


Figure 9-9 shows the rejection region as well as the area under the normal curve associated with the sample 
means. 


rejection region 
Area = .10 


206.40 


Fig. 9-9 


HYPOTHESIS TESTS ABOUT A POPULATION MEAN: LARGE SAMPLES 


This section summarizes the material presented in the proceeding sections. The techniques 
presented are appropriate when the sample size is 30 or more. 


EXAMPLE 9.9 The mean age of policyholders at World Life Insurance Company, determined two years ago, 
was found to equal 32.5 years and the standard deviation was found to equal 5.5 years. It is reasonable to 
believe that the mean age has increased. However, some of the older policyholders are now deceased and some 
younger policyholders have been added. The company determines the ages of 50 current policyholders in order 
to decide whether the mean age has changed. If represents the current mean of all policyholders, the null and 
research hypothesis are stated as follows: 


Ho: = 32.5 (the mean age has not changed) 
H,: U # 32.5 (the mean age has changed) 
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The test is performed at the conventional level of significance, which is @ = .05. From the standard normal 
distribution, it is determined that P(z > 1.96) = .025 and therefore P(z < —1.96) = .025. The rejection and 
nonrejection regions are therefore as follows: 


rejection region} nonrejection region rejection region Zz 
—1.96 1.96 


The standard deviation of the policyholder ages is known to remain fairly constant, and therefore the standard 


error is Ox = iso = .778. The mean age of the sample of 50 policyholders is determined to be 34.4 years. The 


Sots X—Hp 34.4 - 32.5 : . ; 
computed test statistic is z= ———~ = ————— = 2,44. That is, assuming the mean age of all policyholders 


has not changed from two years ago, the mean of the current sample is 2.44 standard errors above the population 
mean. Since this exceeds 1.96, the null hypothesis is rejected and it is concluded that the current mean age 
exceeds 32.5 years. Notice that the four steps in Table 9.2 were followed in performing the test of hypothesis. 


Table 9.2 gives a set of steps that may be used to perform a test of hypothesis about the mean of 
a population. 


Table 9.2 
Steps for Testin e Sample 
Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by 
Ho: ft = Uo, and the research hypothesis is of the form H,: wp # woor Ha: w < poor Hy: pW > po. 


Step 2: Use the standard normal distribution table and the level of significance, a, to determine the 
rejection region. 


x-H ea 
Step 3: Compute the value of the test statistic as follows: z = Z 2 where X is the mean of the 
i 


sample, Ho is given in the null hypothesis, and ox is computed by dividing o by vn. If o is 
unknown, estimate it by using the sample standard deviation, s, when computing ox. 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test 
Statistic falls in the rejection region. Otherwise, the null hypothesis 1s not rejected. 


EXAMPLE 9.10 The police department in a large city claims that the mean 911 response time for domestic 
disturbance calls is 10 minutes. A “watchdog group” believes that the mean response time is greater than 10 
minutes. If [1 represents the mean response time for all such calls, the watchdog group wishes to test the research 
hypothesis that 1 > 10 at level of significance a = .01. The null and research hypothesis are: 


Ho: pt = 10 (the police department claim is correct) 
H,: p> 10 (the police department claim is not correct) 


From the standard normal distribution table, it is found that P(z > 2.33) = .O1. The rejection and nonrejection 
regions are as follows: 


nonrejection region | rejection region z 
2.33 


A sample of 35 response times for domestic disturbance calls is obtained and the mean response time is found to 


be 11.5 minutes and the standard deviation of the 35 response times is 6.0 minutes. The standard error is 


: S 6.0 ae X—Ho 11.5- 10.0 
estimated to be —= =—== = 1.01 minutes. The computed test statistic is z = ———- = ————— = 1.49. 


vn 35 Ox 1.01 


Since the computed test statistic does not exceed 2.33, the police department claim is not rejected. Note that the 
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study has not proved the claim to be true. However, the results of the study are not strong enough to refute the 
claim at the 1% level of significance. 


EXAMPLE 9.11 A sociologist wishes to test the null hypothesis that the mean age of gang members in a large 
city is 14 years vs. the alternative that the mean is less than 14 years at level of significance @ = .10. A sample of 
40 ages from the police gang unit records in the city found that X = 13.1 years and s = 2.5 years. The estimated 


S 2.5 
standard error is —= =——= = AO years. Assuming the null hypothesis to be true, the sample mean is 2.25 
vn 40 


13.1-14 
standard errors below the population mean, since z = i = —2.25. For a = .10, the rejection region is 


z < —1.28. Since the computed test statistic is less than -1.28, the null hypothesis is rejected and it is concluded 
that the mean age of the gang members is less than 14 years. 


The process of testing a statistical hypothesis is similar to the proceedings in a courtroom in the 
United States. An individual, charged with a crime, is assumed not guilty. However, the prosecution 
believes the individual is guilty and provides sample evidence to try and prove that the person is 
guilty. The null and alternative hypothesis may be stated as follows: 


Ho: the individual charged with the crime is not guilty 
H,: the individual charged with the crime is guilty 


If the evidence is strong, the null hypothesis is rejected and the person is declared guilty. If the 
evidence is circumstantial and not strong enough, the person is declared not guilty. Notice that the 
person is not usually declared innocent, but is found not guilty. The evidence usually does not prove 
the null hypothesis to be true, but is not strong enough to reject it in favor of the alternative. 


CALCULATING TYPE II ERRORS 


In testing a statistical hypothesis, a Type If error occurs when the test results in not rejecting the 
null hypothesis when the research hypothesis is true. The probability of a Type II error is represented 
by the Greek letter B and is defined in formula (9.2): 


B = P(not rejecting the null hypothesis when the research hypothesis is true) (9.2) 


Consider once again the mail order company example discussed in previous sections. The null 
and research hypothesis are: 


Ho: # = $155 (the mean sales per order this year is $155) 
H,: Hh # $155 (the mean sales per order this year is not $155) 


The rejection and nonrejection regions are: 


rejection region| _nonrejection region | rejection region 
145 165 


>| 


The level of significance is & = .0456. The level of significance is computed under the assumption 
that the null hypothesis is true, that is, W = 155. The probability of a Type II error is calculated under 
the assumption that the research hypothesis is true. However, the research hypothesis is true 
whenever jt # $155. Suppose we wish to calculate 8 when 1: = 157.5. The sequence of steps to 
compute B are as follows: 


B = P(not rejecting the null hypothesis when the research hypothesis is true) 
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B = P(145 < x < 165 when pt = 157.5) 


145-1575 %-1575 165-1575 
er es gp ee 
5 5 


The event 145 < x < 165 when ut = 157.5 is equivalent to or 


-~2.5 <z< 1.5. The calculation of B is therefore given as follows: 
B = P(-2.5 <z< 1.5) = .4332 + .4938 = .9270 
Suppose we wish to compute B when pt = 160. The sequence of steps to compute B are as follows: 
B = P(not rejecting the null hypothesis when the research hypothesis is true) 
B = P(145 < x < 165 when pt = 160.0) 


145-160 x-160 165-160 


The event 145 < X < 165 when pt = 160.0 is equivalent to : : 


or 


-~3 <z< 1. The calculation of B is therefore given as follows: 
B = P(-3<z< 1) = 4986 + .3413 = 8399 


Notice that the value of B is not constant, but depends on the alternative value assumed for ut. Table 
9.3 gives the values of B for several different values of pL. 


Table 9.3 


A plot of B vs. p is called an operating characteristic curve. An operating characteristic curve for 
Table 9.3 is shown in Fig. 9-10. 


B 
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EXAMPLE 9.12 Suppose we wish to construct an operating characteristic curve for the hypothesis concerning 
the tire manufacturer’s claim discussed in Example 9.3. Recall that the null and research hypothesis were as 
follows: 

Ho: HL = 60,000 miles (the tire manufacturer’s claim is correct) 

H,: 2p < 60,000 miles (the tire manufacturer's claim is false) 


The rejection and nonrejection regions were as follows: 


rejection region |_nonrejection region x 
58,000 


The level of significance is @& = .0228. To illustrate the computation of B, suppose we wish to compute the 
probability of committing a Type IT error when pL = 59,000. 


B = P(not rejecting the null hypothesis when the research hypothesis is true) 
B = P(x > 58,000 when pt = 59,000) 


ae ; ; x-—59,000 58,000-—59,000 
The event x > 58,000 when Lt = 59,000 is equivalent (¢« —————— > ————————_ or z > - I. Therefore, 
1,000 1,000 


B = P(z > -1) = .5000 + .3413 = 8413. 


Table 9.4 gives the B values for several different [1 values. 


Table 9.4 
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P VALUES 
Consider once again the mail order company example. The null and research hypotheses are: 


Ho: pt = $155 (the mean sales per order this year is $155) 
H,: tb # $155 (the mean sales per order this year is not $155) 


For level of significance & = .05, the rejection region is shown in Fig. 9-12. 


rejection region nonrejection region rejection region z 
-1.96 1.96 
Fig. 9-12 
To understand the concept of p value, consider the following four scenarios. Also recall that the 
standard error is ox = ee = 5. 


100 


Scenario |: The mean of the 100 sampled accounts is found to equal $149. The computed test 


see 149 ~ 155 : facie : tach : 
Statistic is z = ———— = -1.2, and since the computed test statistic falls in the nonrejection region, 
5 


the null hypothesis is not rejected. 


Scenario 2: The mean of the 100 sampled accounts is found to equal $164. The computed test 
statistic is z = 1.8, and since the computed test statistic falls in the nonrejection region, the null 
hypothesis is not rejected. 


Scenario 3: The mean of the 100 sampled accounts is found to equal $165. The computed test 
Statistic is z = 2.0, and since the computed test statistic falls in the rejection region, the null 
hypothesis is rejected. 


Scenario 4: The mean of the 100 sampled accounts is found to equal $140. The computed test 
statistic is z = ~3.0, and since the computed test statistic falls in the rejection region, the null 
hypothesis is rejected. 


Notice in scenarios | and 2 that even though the evidence is not strong enough to reject the null 
hypothesis, the test statistic is nearer the rejection region in scenario 2 than scenario |. Also notice in 
scenarios 3 and 4 that the evidence favoring rejection of the null hypothesis is stronger in scenario 4 
than scenario 3, since the sample mean in scenario 4 is 3 standard errors away from the population 
mean, and the sample mean is only 2 standard errors away in scenario 3. The p value is used to 
reflect these differences. The p value is defined to be the smallest level of significance at which the 
null hypothesis would be rejected. 

In scenario |, the smallest level of significance at which the null hypothesis would be rejected 
would be one in which the rejection region would be z $ -1.2 or z 2 1.2. The level of significance 
corresponding to this rejection region is 2 x P(z 2 1.2) = 2 x (.5 — .3849) = .2302. The p value 
corresponding to the test statistic z = -1.2 is .2302. 

In scenario 2, the smallest level of significance at which the null hypothesis would be rejected 
would be one in which the rejection region would be z < -1.8 or z 2 1.8. The level of significance 
corresponding to this rejection region is 2 x P(z 2 1.8) = 2 x (.5 — .4641) = .0718. The p value 
corresponding to the test statistic z = 1.8 is .0718. 
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In scenario 3, the smallest level of significance at which the null hypothesis would be rejected 
would be one in which the rejection region would be z $ -2.0 or z 2 2.0. The level of significance 
corresponding to this rejection region is 2 x P(z 2 2.0) = 2 x (.5 - .4772) = .0456. The p value 
corresponding to the test statistic z = 2.0 is .0456. 

In scenario 4, the smallest level of significance at which the null hypothesis would be rejected 
would be one in which the rejection region would be z < -3.0 or z 2 3.0. The level of significance 
corresponding to this rejection region is 2 x P(z 2 3.0) = 2 x (.5 - .4987) = .0026. The p value 
corresponding to the test statistic z = 3.0 is .0026. 

To summarize the procedure for computing the p value for a two-tailed test, suppose z* 
represents the computed test statistic when testing Ho: H = [, vs. Ha: LH # Ho. The p value is given by 
formula (9.3): 


p value = P(| z| > |z* |) (9.3) 


When using the p value approach to testing hypothesis, the null hypothesis is rejected if the p value 
is less than &. In addition, the p value gives an idea about the strength of the evidence for rejecting 
the null hypothesis. The smaller the p value, the stronger the evidence for rejecting the null 
hypothesis. 


EXAMPLE 9.13 In Example 9.10, the following statistical hypothesis was tested for @ = .01: 


Ho: p = 10 (the police department claim ts correct) 
H,: 1 > 10 (the police department claim 1s not correct) 


The computed test statistic is z* = 1.49. The smallest level of significance at which the null hypothesis would be 
rejected is one in which the rejection region is z > 1.49. The p value is therefore equal to P(z > 1.49). From the 
standard normal distribution table, we find the p value = .5 — .4319 = .0681. Using the p value approach to 
testing, the null hypothesis is not rejected since the p value > a. 


To summarize the procedure for computing the p value for an upper-tailed test, suppose z* 
represents the computed test statistic when testing Ho: H = My vs. Ha : H > Mo. The p value is given by 


p value = P(z > z*) (9.4) 
EXAMPLE 9.14 In Example 9.11, the following statistical hypothesis was tested for a = .10. 


Ho: HL = 14 (the mean age of gang members in the city is 14) 
H,: Ut < 14 (the mean age of gang members in the city is less than 14) 


The computed test statistic is z* = —2.25. The smallest level of significance at which the null hypothesis would 
be rejected is one in which the rejection region is z < —2.25. The p value is therefore equal to P(z < -2.25). 
Using the standard normal distribution table, we find the p value = .5 — .4878 = .0122. Using the p value 
approach to testing, the null hypothesis is rejected since the p value < .10. 


To summarize the procedure for computing the p value for a lower-tailed test, suppose z* represents 
the computed test statistic when testing Ho: |! = Uo vs. Ha: |! < Mo. The p value is given by 


p value = P(z < z*) (9.5) 
EXAMPLE 9.15 A random sample of 40 community college tuition costs was selected to test the null 


hypothesis that the mean cost for all community colleges equals $1500 vs. the alternative that the mean does not 
equal $1500. The level of significance is .05. The data are shown in Table 9.5. 
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The Minitab analysis for this test of hypothesis is shown below. The command set cl is used to put the data in 
column |. The name cost is given to column |. The command standard deviation cl is used to find s. The 
command ztest mean = 1500 sigma = 655.44 data in cl; is used to specify the value of [o, the estimate of 
sigma, and the column where the data are located. The subcommand alternative 0. indicates a two-tailed test. 
Since the p value is .32 and exceeds .05, the null hypothesis ts not rejected. 


MTB > set cl 

DATA >1200 850 1700 1500 700 1200 1500 2000 1950 750 
DATA > 850 3000 2100 500 500 1950 1000 950 560 500 
DATA >1750 1650 900 2050 1780 675 1080 680 900 1500 
DATA > 930 1640 1320 1750 2500 2310 2900 1875 1450 950 
DATA > end 

MTB > name cl ‘cost’ 

MTB > standard deviation cl 


Column Standard Deviation 


Standard deviation of cost = 655.44 
MTB > ztest mean = 1500 sigma = 655.44 data incl; 
SUBC> alternative 0. 


Z-Test 
Test of mu = 1500 vs mu not = 1500 
The assumed sigma = 655 


Variable N Mean StDev SE Mean Z P 
cost 40 1396 655 104 ~1.00 .32 


To test the research hypothesis that  < 1500, the subcommand alternative —1. is used. The p value is now 
equal to .16. 


MTB > ztest mean = 1500 sigma = 655.44 data incl; 
SUBC > alternative —1. 


Z-Test 
Test of mu = 1500 vs mu < 1500 
The assumed sigma = 655 


Variable N Mean StDev SEMean Z Pp 
cost 40 1396 655 104 -1.00 .16 


To test the research hypothesis that p > 1500, the subcommand alternative 1. is used. The p value is now equal 
to .84. 


MTB > ztest mean = 1500 sigma = 655.44 data inc]; 
SUBC > alternative 1. 
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Z-Test 
Test of mu = 1500 vs mu > 1500 
The assumed sigma = 655 


Variable N Mean St Dev SE Mean Z P 
cost 40 1396 655 104 -1.00 .84 


HYPOTHESIS TESTS ABOUT A POPULATION MEAN: SMALL SAMPLES 


The procedure given in Table 9.2, for testing statistical hypothesis about the population mean 
when the sample size is large (n 2 30), 1s valid for all types of population distributions. The 
procedures given in this section when the sample size is small (n < 30) require the assumption that 
the population have a normal distribution. If the population standard deviation is known, the standard 
normal distribution table is used to determine the rejection and nonrejection regions. If the 
population standard deviation is unknown, the t distribution table is used to determine the rejection 
and nonrejection regions. Since the population standard deviation is usually unknown in practice, we 
shall discuss this case only. Table 9.6 summarizes the procedure for small samples from a normally 
distributed population with unknown o. 


Table 9.6 


Steps for Testing a Hypothesis Concerning a Population Mean: Small Sample 
Normally Distributed Population with Unknown o 


Step |: State the null and research hypothesis. The nuil hypothesis is represented symbolically by Ho: = Ho, 
and the research hypothesis is of the form H,: UW # Poor Ha: < floor Hy: Md > bo- 


Step 2: Use the t distribution table, with degrees of freedom = n—- 1 and the level of significance, a, to 
determine the rejection region. 

— Lo 
SK 
is given in the null hypothesis, and sz is computed by dividing s byvn. 


me x 
Step 3: Compute the value of the test statistic as follows: t = 


, where x is the mean of the sample, [lo 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls 
in the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 9.16 In order to test the hypothesis that the mean age of all commercial airplanes is 10 years vs. 
the alternative that the mean is not 10 years, a sample of 25 airplanes is selected from several airlines. A 
histogram of the 25 ages is mound-shaped and it is therefore assumed that the distribution of all airplane ages is 
normally distributed. The sample mean is found to equal 11.5 years and the sample standard deviation is equal 
to 5.5 years. The level of significance is chosen to be & = .05. From the t distribution tables with df = 24, it is 
determined that the area to the right of 2.064 is .025. By symmetry, the area to the left of -2.064 is also .025. 
The rejection regions as well as the corresponding areas under the t distribution curve are shown in Fig. 9-13. 


rejection region 
Area = .025 


rejection region 
Area = .025 


Fig. 9-13 
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; 5, Hot X-H, 115-10 
The estimated standard error is sz; = ine = 1.1 years. The computed test statistic is t= ——~ = 
5 Si | 


1.36, and since this value does not fall in the rejection region, the null hypothesis is not rejected. 


EXAMPLE 9.17 A social worker wishes to test the hypothesis that the mean monthly household payment for 
food stamps in Shelby county is $225 vs. the alternative that it is greater than that amount. The payments for 20 
randomly selected households are given in Table 9.7. 


Table 9.7 


A histogram of the payments is shown in Fig. 9-14. The histogram indicates that it is reasonable to assume that 
the payments are normally distributed. 


4 


w 


Frequency 
fo 


i Me Sa ad ~T an aaa Cea 
190 210 230 250 270 290 310 
payment 
Fig. 9-14 


The Minitab analysis for the data is shown below. The command ttest mean = 225 data in cl; gives the value 
for Ho and the location of the sample data in column |. The subcommand alternative = 1. Identifies the 
alternative as being upper-tailed. The p value indicates that the results are significant for any level of 
significance greater than .0023. 


MTB > set cl 

DATA > 300) 250 200 225 275 280 220 210 290 310 
DATA> 190 255 245 265 235 250 272 228 213 277 
DATA > end 

MTB > ttest mean = 225 data in cl; 

SUBC > alternative = 1. 

T-Test of the Mean 

Test of mu = 225.00 vs mu > 225,00 


Variable N Mean StDev SEMean- T P 
Payment 20 249.50 34.04 7.61 3.22 0.0023 


HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION: 
LARGE SAMPLES 


In Chapter 7 the sampling distribution of p, the sample proportion, was discussed. When the 
sample size satisfies the inequalities np > 5 and nq > 5, the sampling distribution of the sample 
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proportion is normally distributed with mean equal to p and standard error of; = |— . This result is 
n 


sometimes referred to as the central limit theorem for the sample proportion. Furthermore, it was 


Us ee ae ; 

shown that z = — P has a standard normal distribution. These results form the theoretical 
P 

underpinnings for tests of hypothesis concerning p, the population proportion. Table 9.8 gives the 


steps that may be followed when testing a hypothesis about the population proportion. 


Table 9.8 


Steps for Testing a Hypothesis Concerning a Population Proportion: 
Large Samples (np > 5 and ng > 5) 


Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p = po, 
and the research hypothesis is of the form H,: p # poor H,: p < po or H,: p > Po. 


Step 2: Use the standard normal distribution table to determine the rejection and nonrejection regions. 


P- Po — . : 
, where P is the sample proportion, pp is 


Step 3: Compute the value of the test statistic as follows: z = 


Pp 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls 
in the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 9.18 Health-care coverage for employees varies with company size. It is reported that 30% of all 
companies with fewer than 10 employees provide health benefits for their employees. A sample of 50 companies 
with fewer than 10 employees is selected to test Ho: p = .3 vs. H,: p # .3 at @= .01. It is found that 19 of the 50 
companies surveyed provide health benefits for their employees. Using the standard normal table, it is found 
that the area under the curve to the right of 2.58 is .005 and by symmetry, the area to the left of -2.58 is also 
.005. The rejection regions as well as the associated areas under the standard normal curve are shown in Fig. 
9-15. 


rejection region 
Area = .005 


rejection region 
Area = .005 


—2.58 2.58 
Fig. 9-15 
. a x oe ee 
Assuming the null hypothesis to be true, the standard error of the proportion is of = ‘ae =) = 
n 
ee oe ae 38 — .30 
.065. The sample proportion is p = —— = .38. The computed value of the test statistic is z* = =e =1.23: 
50 : 


Since the computed value of the test statistic is not in the rejection region, the null hypothesis is not rejected. 
The p value is P(| z| > 1.23) = P(z < -1.23) + P(z> 1.23) = 2 x P(z > 1.23) = 2 x (.5 — 3907) = .2186. 
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EXAMPLE 9.19 A national survey asked the following question of 2500 registered voters: “Is the character of 
a candidate for president important to you when deciding for whom to vote?” Two thousand of the responses 
were yes. Let p represent the percent of all registered voters who believe the character of the president is 


important when deciding for whom to vote. The results of the survey were used to test the null hypothesis Ho: 


p = 90% vs. Hy: p < 90% at level of significance @ = 05, The sample percent is B® = 32" x 100% = 80%. 


P xq 90x 1¢ 
Assuming the null hypothesis to be true, the standard error is Of = a = — = 60%. The computed 


80-90 
60 


test statistic is 2* = = -16.7. The null hypothesis would be rejected at all practical levels of significance. 


Solved Problems 


NULL HYPOTHESIS AND ALTERNATIVE HYPOTHESIS 


9.1 A study in [992 established the mean commuting distance for workers in a certain city to be 15 
miles. Because of the westward spread of the city, it is hypothesized that the current mean 
commuting distance exceeds !5 miles. A traffic engineer wishes to test the hypothesis that the 
mean commuting distance for workers in this city is greater than 15 miles. Give the null and 
alternative hypothesis for this scenario. 


Ans. Ho: H = 15, H,: h > 15, where pt represents the current mean commuting distance in this city. 


9.2 The mean number of sick days used per year nationally is reported to be 5.5 days. A study is 
undertaken to determine if the mean number of sick days used for nonunion members in 
Kansas differs from the national mean. Give the null and alternative hypothesis for this 
scenario. 


Ans. Ho: uw = 5.5, H,: p # 5.5, where 1 represents the mean number of sick days used for nonunion 
members in Kansas. 


TEST STATISTIC, CRITICAL VALUES, REJECTION 
AND NONREJECTION REGIONS 


9.3. Refer to problem 9.1. The decision is made to reject the null hypothesis if the mean of a 
sample of 49 randomly selected commuting distances is more than 2.5 standard errors above 
the mean established in 1992. The standard deviation of the sample is found to equal 3.5 miles. 
Give the rejection and nonrejection regions. 


Ss 35 


be rejected if the sample mean of the 49 commuting distances exceeds the 1992 population mean 
by 2.5 standard errors. That is, reject Ho if K > 1542.5 x 5 = 16.25. 


nonrejection region | rejection region x 
16.25 
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9.4 Refer to problem 9.2. The null hypothesis is to be rejected if the mean of a sample of 250 
nonunion members differs from the national mean by more than 2 standard errors. The 
standard deviation of the sample is found to equal 2.1 days. Give the rejection and nonrejection 
regions. 


s 2 
Ans, The estimated standard error of the mean is sj = = = —== = .13 days. The null hypothesis ts 
vn 250 


to be rejected if the mean of the sample differs from the national mean by more than 2 standard 
errors. That is, reject Hp if x <5.5-2x .13=5.24 orif x > 55+2 x .13=5.76. 


TYPE I AND TYPE II ERRORS 


9.5 Refer to problems 9.1 and 9.2. Describe, in words, how a Type I and a Type II error for both 
problems might occur. 


Ans. In problem 9.1, a Type I error is made if it ts concluded that the current mean commuting distance 
exceeds 15 miles when in fact the current mean ts equal to 15 miles. A Type II error is made if it is 
concluded that the current mean commuting distance is 15 miles when in fact the mean exceeds 15 
miles. 


In problem 9.2, a Type I error is made if it is concluded that the mean number of sick days used 
for the nonunion members in Kansas is different than the national mean when in fact it is not 
different. A Type H error is made if it is concluded that the mean number of sick days used for 
nonunion members in Kansas is equal to the national mean when in fact it differs from the national 
mean. 


9.6 Refer to problems 9.3 and 9.4. Find the level of significance for the test procedures described 
in both problems. 


Ans. In problem 9.3, a = P(rejecting the null hypothesis when the null hypothesis is true) or a = P(x > 
16.25 when pt = 15) = P(z > 2.5), since 16.25 is 2.5 standard errors above p = 15. a = P(z > 2.5) = 
5 — .4938 = .0062. 


In problem 9.4, a = P(rejecting the null hypothesis when the null hypothesis is true) or © = P(X < 
5.24 or x > 5.76 when pt = 5.5) = P(z < —2 or z > 2), since 5.24 is 2 standard errors below pt = 5.5 
and 5.76 is 2 standard errors above p = 5.5. a= P(z < —2) + P(z > 2) = 2 x (.5 — .4772) = .0456. 


HYPOTHESIS TESTS ABOUT A POPULATION MEAN: LARGE SAMPLES 


9.7. A sample of size 50 is used to test the following hypothesis: Ho: H = 17.5 vs. Ha: f # 17.5 and 
the level of significance is .01. Give the critical values, the rejection and nonrejection regions, 
the computed test statistic, and your conclusion if: (a) x = 21.5 and s = 5.5 (6 unknown), (b) 
* = 21.5 ando=5.0 


Ans. The critical values and therefore the rejection and nonrejection regions are the same in both parts. 
Since P(z < -2.58) = P(z > 2.58) = .00S, the critical values are -2.58 and 2.58. The rejection and 
nonrejection regions are as follows: 


rejection region | nonrejection region | rejection region Z 
-2.58 2.58 
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s 55 
In part (a) the standard error is estimated by s; = —= = ——= = .78, since the population standard 
n ¥50 
ee ae 215-175 
deviation is unknown. The computed test statistic is z = —————— = 5.13, and the null hypothesis 
78 
is rejected since this value falls in the rejection region. 
o 5 
In part (b) the standard error is computed by og = —= = ——= =.7I, since the population standard 
vn 50 
are, Se 215-175 
deviation is known. The computed test statistic is z = -—————— = 5.63, and the null hypothesis is 
71 


rejected since this value falls in the rejection region. 


9.8 A state environmental study concerning the number of scrap-tires accumulated per tire 
dealership during the past year was conducted. The null! hypothesis is Ho: & = 2500 and the 
research hypothesis is H : 1 # 2500, where Lt represents the mean number of scrap-tires per 
dealership in the state. For a random sample of 85 dealerships, the mean is 2750 and the 
standard deviation is 950. Conduct the hypothesis test at the 5% level of significance. 


Ans. The null hypothesis is rejected if |z*| > 1.96, where z* is the computed value of the test statistic, 


s 
The estimate standard error is s; =—= = 25° = 103 miles. The computed value of the test 
vn V5 
sae S 2750 — 2500 ; ; 
statistic is z* = ar = 2.43. It is concluded that the mean number of scrap-tires per 
1 


dealership exceeded 2500 last year. 


CALCULATING TYPE II ERRORS 


9.9 In problems 9.1 and 9.3, the hypothesis system Ho: w= 15 and H,: p > 15 was tested by using a 
sample of size 49. The null hypothesis was to be rejected if the sample mean exceeded 16.25. 
Suppose the true value of }! is 16.5. What is the probability that this test procedure will not 
result in the rejection of the null hypothesis? 


Ans. {= P(not rejecting the null hypothesis when p = 16.5) = P(x < 16.25 when p = 16.5). In problem 
9.3, the estimated standard error was found to equal .5. The event x < 16.25 is equivalent to the 


K-165 16.25- 16.5 
< —————__ = -.5 when pt = 16.5. Therefore, B = P(z < -.5) = P(z > .5) = .5 - 
5 2 


1915 = 3085. That is, there is a 30.85% chance that, even though the mean commuting distance 
has increased from the 15 miles figure of 1992 to the current figure of 16.5 miles, the hypothesis 
test will result in concluding that the current mean ts the same as the 1992 mean. 


event z = 


9.10 In problems 9.2 and 9.4, the hypothesis system Ho: 1 = 5.5 and H,: Hf # 5.5 was tested by using 
a sample of size 250. The null hypothesis was to be rejected if the sample mean was less than 
5.24 or exceeded 5.76. Suppose the true value of [1 is 5.0. What is the probability that this test 
procedure will not result in the rejection of the null hypothesis? 


Ans. B = P(not rejecting the null hypothesis when bt = 5.0) = P(S.24 < x < 5.76 when wp is 5.0). In 
problem 9.4, the estimated standard error was found to equal .13. The event 5.24 < kK < 5.76 when 
5.24-5.0 .«-5.0 5.76-5.0 


13 A 13 


Lt is 5.0 is equivalent to the event which is the same as the event 
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1.85 <z< 5.85. Therefore B = P(1.85 < z < 5.85). Since there is practically zero probability to the 
right of 5.85, B = P(z > 1.85) = .5 — .4678 = .0322. 


P VALUES 


9.11 


9.12 


Use the p value approach to test the hypothesis in problem 9.8. 


Ans. The computed value of the test statistic is z* = 2.43. Since the hypothesis is two-tailed, the p value 
is given by p value = P({ z| > | z* |) or equivalently P(z < -] z* |) + P(z >| z* |) = P(z < -2.43) + 
P(z > 2.43). Since these two probabilities are equal, we have the p value = 2 x P(z > 2.43) = 2 x 
(.5 — 4925) = .015. Using the p value approach, the null hypothesis is rejected since the p value 


< a. 
Find the p values for the following hypothesis systems and z* values. 

(a) Ho: w= 22.5 vs. Ha: UW # 22.5 and z* = 1.76 

(b) Ho: b= 37.1 vs. Ha: hb 437.1 and z* = —2.38 

(c) Ho: p= 12.5 vs. Ha: W < 12.5 and z* = -.98 

(d) Ho: p = 22.5 vs. Ha: Wb < 22.5 and z* = 0.35 

(e) Ho: 2 = 100 vs. H,: p > 100 and z* = 2.76 

(f) Ho: pf = 72.5 vs. Ha: fp > 72.5 and z* = -0.25 

Ans. (a) The p value is given by p value = P(| z| >| z*|) = P| z | >1.76) = 2 x P(z > 1.76) =2 x 


(.5 — .4608) = .0784. 
(b) The p value is given by p value = P(| z| >| z*|) = P(| z | > 2.38), since the absolute value of 
—2.38 is 2.38. Therefore, p value = P(| z | > 2.38) = 2 x P(z > 2.38) = 2 x (.5 — 4913) = .O174. 
(c) The p value is given by p value = P(z < 2*) = P(z < -.98) = P(z > .98) = (.5 — .3365) = .1635. 
(d) The p value is given by p value = P(z < z*) = P(z < .35) = .5 + .1368 = .6368. 
(e) The p value is given by p value = P(z > z*) = P(z > 2.76) = 5 — .4971 = .0029. 
(f) The p value is given by p value = P(z > z*) = P(z > ~.25) = .5 + .0987 = 5987. 


HYPOTHESIS TESTS ABOUT A POPULATION MEAN: SMALL SAMPLES 


9.13 


A psychological test, used to measure the level of hostility, is known to produce scores which 
are normally distributed, with mean = 35 and standard deviation = 5. The test 1s used to test the 
null hypothesis that 1 = 35 vs. the research hypothesis that u > 35, where UM represents the 
mean score for criminal defense lawyers. Sixteen criminal defense lawyers are administered 
the test and the sample mean is found to equal 39.5. The sample standard deviation is found to 
equal 10. The level of significance is selected to be a = .01. 

(a) Perform the test assuming that 6 = 5. This amounts to assuming that the standard deviation 

of test scores for criminal defense lawyers is the same as the general population. 

(b) Perform the test assuming 6 is unknown. 


oe ee 
is 


Ans. (a) Since the test scores are normally distributed and o is known, the test statistic, 2 = 
Ox 

normally distributed. The rejectton region is found using the standard normal distribution 

tables. The null hypothesis is rejected if the test statistic exceeds 2.33.The calculated value of 


395-35 
the test statistic is z* = ———— = 3.6, where the standard error is computed as 
1.25 


—= = — = = 1.25. The null hypothesis is rejected since 3.6 falls in the rejection region. 
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x— 35 


(b) Because o is unknown and the sample size is small the test statistic, t = has at 


Sk 
distribution with df = 16 — | = 15. The rejection region is found by using the ¢ distribution 
with df = 15. The null hypothesis is rejected if the test statistic exceeds 2.602. The calculated 


395-35 
value of the test statistic is (* = al = 1.80, where the estimated standard error is 


10 


fall in the rejection region. 


9.14 The weights in pounds of weekly garbage for 25 households is shown tn Table 9.9. Use these 
data to test the hypothesis that the weekly mean for all households in the city is 10 pounds. The 
alternative hypothesis ts that the mean differs from 10 pounds. The level of significance is @ = 
0S. 


Table 9.9 


Ans. The Minitab solution is shown below. 


MTB > set cl 

DATA> 5.5 7.5 12.5 100 15.5 132 40 25 14.0 13.3 
DATA> 155 33 7.5 100 66 165 90 85 166 9.7 
DATA> 18.0 176 49 2.9 13.4 

DATA > end 

MTB > ttest mean = 10 data incl; 

SUBC > alt = 0. 


T-Test of the Mean 
Test of mu = 10.000 vs mu not = 10.000 


Variable N Mean StDev SE Mean T P 
Cl 25 10.320 4.923 0.985 0.33 0.75 


Since the p value > .05, the null hypothesis is not rejected. 


HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION: 
LARGE SAMPLES 


9.15 A survey of 2500 women between the ages of 15 and 50 found that 28% of those surveyed 
relied on the pill for birth control. Use these sample results to test the null hypothesis that p = 
25% vs. the research hypothesis that p # 25%, where p represents the population percentage 
using the pill for birth control. Conduct the test at @ = .05 by giving the critical values, the 
computed value of the test statistic, and your conclusion. 
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p- x 
Ans. - The, critical values are + 1.96. The test statistic is given by z = aa , where of = {oe 
. n 


t Op 
oe: 25x75 os ae oy3e 
The standard error of the proportion is op = 3500 = .87%, and the computed test statistic is 
28 ~ 25 SA urs a 
z= = 3.45. The null hypothesis is rejected since the computed test statistic exceeds the 
87 


critical value, 1.96. 


9.16 A poll taken just prior to election day finds that 389 of 700 registered voters intend to vote for 
Jake Konvalina for mayor of a large Midwestern city. Test the null hypothesis that p = .5 vs. 
the alternative that p > .5 at @ = .05, where p represents the proportion of all voters who intend 
to vote for Jake. Use the p value approach to do the test. 


Ans. The sample proportion is p=> = .556, and the standard error of the proportion is 


x p-P, 556-5 
Op= Pore = oO = .019. The computed test statistic is 2* = pete - 2 2.95. 
n 700 OF 019 


The p value is given by P(z > 2.95) = .5 — .4984 = .0016. The null hypothesis is rejected since the 
p value < .05. 


Supplementary Problems 


NULL HYPOTHESIS AND ALTERNATIVE HYPOTHESIS 


9.17 Classify each of the following as a lower-tailed, upper-tailed, or two-tailed test: 
(a) Ho: p= 35, Haw < 35) (b) Ho: w= 1.2, Hy: 41.2 (ce) Ho: p= $1050, H,: wp > $1050 


Ans. (a) lower-tailed test (b) two-tailed test (c) upper-tailed test 


9.18 Suppose the current mean cost to incarcerate a prisoner for one year is $18,000. Consider the following 
scenarios. 
(a) A prison reform plan is implemented which prison authorities feel will reduce prison costs. 
(b) It is uncertain how a new prison reform plan will affect costs. 
(c) A prison reform plan is implemented which prison authorities feel will increase prison costs. 


Let represent the mean cost per prisoner per year after the reform plan is implemented. Give the 
research hypothesis in symbolic form for each of the above cases. 


Ans. (a) H,:  < $18,000 (b) H,: 1 # $18,000 (c) H,: 1 > $18,000 


TEST STATISTIC, CRITICAL VALUES, REJECTION AND NONREJECTION REGIONS 


9.19 Acoin is tossed 10 times to determine if it is balanced. The coin will be declared “fair,” i.e., balanced, if 
between 2 and 8 heads, inclusive, are obtained. Otherwise, the coin will be declared “unfair.” The null 
hypothesis corresponding to this test is Ho: (the coin is fair), and the alternative hypothesis is H,: (the 
coin is unfair), Identify the test statistic, the critical values, and the rejection and nonrejection regions. 


Ans. The test statistic is the number of heads to occur in the 10 tosses of the coin. The critical values are 
1 and 9. The rejection region is x = 0, 1, 9, or 10, where x represents the number of heads to occur 
in the 10 tosses. The nonrejection region is x = 2, 3, 4, 5, 6, 7, or 8. 
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9.20 A die is tossed 6 times to determine if it is balanced. The die will be declared “fair,” i.e., balanced, if the 
face “6” turns up 3 or fewer times. Otherwise, the die will be declared “unfair.” The null hypothesis 
corresponding to this test is Ho: (the die is fair), and the alternative hypothesis is H,: (the die is unfair). 
Identify the test statistic, the critical values, and the rejection and nonrejection regions. 

Ans. The test statistic is the number of times the face “6” turns up in 6 tosses. The critical value is 4. 
The rejection region is x = 4, 5, or 6, where x represents the number of times the face “6” turns up 
in the 6 tosses of the die. The nonrejection region is x = 0, 1, 2, or 3. 

TYPE 1 AND TYPE I! ERRORS 

9.21 In problems 9.19 and 9.20, describe how a Type I and a Type II error would occur. 

Ans. In problem 9.19, if the coin is fair, but you happen to be unlucky and get 9 or 10 heads or 9 or 10 

tails and therefore conclude that the coin is unfair, you commit a Type I error. If the coin is 
actually bent so that it favors heads more often than tails or vice versa, but you obtain between 2 
and 8 heads, then you commit a Type II error. 
In problem 9.20, if the die is fair, but you happen to be unlucky and obtain an unusual happening 
such as 4 or more sixes in the 6 tosses and therefore conclude that the die is unfair, you commit a 
Type I error. If the die is loaded so that the face “6” comes up more often than the other faces, but 
when you toss it, you actually obtain 3 or fewer sixes in the 6 tosses, then you commit a Type II 
error. 

9.22 Find the level of significance in problems 9.19 and 9.20. 


Ans. The level of significance is given by & = P(rejecting the null hypothesis when the null hypothesis 
is true). In problem 9.19, if the null hypothesis is true, then x, the number of heads in 10 tosses of a 
fair coin, has a binomial distribution and the level of significance is a = P(x = 0, 1, 9, or 10 when 

= .5). Using the binomial tables, we find that the level of significance is @ = .0010 + .0098 + 
.0010 + .0098 = .0216. In problem 9.20, if the null hypothesis is true, then x, the number of times 


the face “6” turns up in 6 tosses, has a binomial distribution with n = 6 and p = z = .167, and the 


level of significance is a = P(x = 4, 5, or 6 when p = .167). Using the binomial probability 
formula, we find & = .008096 + .000649 +.000022 = .008767. 


HYPOTHESIS TESTS ABOUT A POPULATION MEAN: LARGE SAMPLES 


9.23 


9.24 


Home Videos Inc. surveys 450 households and finds that the mean amount spent for renting or buying 
videos is $13.50 per month and the standard deviation of the sample is $7.25. Is this evidence sufficient 
to conclude that 1 > $12.75 per month at level of significance a = .01? 


Ans. The computed test statistic is z* = 2.19 and the critical value is 2.33. The null hypothesis is not 
rejected. 


A survey of 700 hourly wages in the U.S. is taken and it is found that: X = $17.65 and s = $7.55. Is this 
evidence sufficient to conclude that 1 > $17.20, the stated current mean hourly wage in the U.S.? Test at 
a= .0S. 


Ans. The computed test statistic is z* = 1.58 and the critical value is 1.65. The null hypothesis is not 
tejected. 
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CALCULATING TYPE II ERRORS 


9.25 In problem 9.23, find the probability of a Type II error if p = $13.25. Use the sample standard deviation 
given in problem 9.23 as your estimate of 0. 


Ans. § = P(not rejecting the null hypothesis when p = $13.25) = P(X < 13.55 when bt = $13.25) = 
P(z < .87) = .8078. 


9.26 In problem 9.24, find the probability of a Type II error if 4 = $18.00. Use the sample standard deviation 
given in problem 9.24 as your estimate of 6. 


Ans. [ = P(not rejecting the null hypothesis when p = $18.00) = P(X < 17.67 when p = $18.00) = 
P(z <-1.16) = .123. 


P VALUES 


9.27 Find the p value for the sample results given in problem 9.23. Use this computed p value to test the 
hypothesis given in the problem. 


Ans. The p value = P(Z > 2.19) = .0143. Since the p value exceeds the preset &, the null hypothesis is 
not rejected. 


9.28 Find the p value for the sample results given in problem 9.24. Use this computed p value to test the 
hypothesis given in the problem. 


Ans. The p value = P(Z > 1.58) = .0571. Since the p value exceeds the preset o, the null hypothesis is 
not rejected. 


HYPOTHESIS TESTS ABOUT A POPULATION MEAN: SMALL SAMPLES 


9.29 The mean score on the KSW Computer Science Aptitude Test is equal to 13.5. This test consists of 25 
problems and the mean score given above was obtained from data supplied by many colleges and 
universities. Metropolitan College administers the test to 25 randomly selected students and obtains a 
mean equal to 12.0 and a standard deviation equal to 4.1. Can it be concluded that the mean for all 
Metropolitan students is less than the mean reported nationally? Assume the scores at Metropolitan are 
normally distributed and use a = .05. 


Ans. The null hypothesis is rejected since t* = -1,83 < -1.711. The mean score for Metropolitan 
students is less than the national mean. 


9.30 The claim is made that nationally, the mean payout per $1 waged is 90 cents for casinos. Twenty 
randomly selected casinos are selected from across the country. The data and a Minitab analysis are 
shown below and on the next page. If the null hypothesis is Hp: 1 = 90, the alternative hypothesis is H,: 
Lt < 90, and the level of significance is & = .01, what is your conclusion? 


Data Display 

payout 

85 82 85 83 91 78 90 
86 95 82 86 74 89 92 
96 81 90 86 92 90 


MTB > ttest mean = 90 data incl; 
SUBC > alt= -1. 


T-Test of the Mean 
Test of mu = 90.00 vs mu < 90.00 
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Variable N Mean StDev SE Mean T P 
Payout 20 86.65 5.63 1.26 -2.66 0.0077 


Ans. Since the p value is less than o& = .O1, reject the claim and conclude that the mean payout is less 
than 90 cents per dollar. 


HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION: LARGE SAMPLES 


9.31 


9.32 


A survey of 300 gun owners is taken and 40% of those surveyed say they have a gun for protection for 
self/family. Use these results to test Ho: p = 32% vs. H,: p > 32%, where p is the percent of all gun 
owners who say they have a gun for protection of self/family. Test at a = .025. 


Ans. The null hypothesis is rejected, since z* = 2.97 > 1.96. It is concluded that the percent is greater 
than 32%. 


Perform the following hypothesis about p. 

(a) Ho: p= .35 vs. Hy: p¥ .35, n= 100, p = .38, a= .10 
(b) Ho: p=.75 vs. Hy: p< .75,n = 700, p =.71, a= .05 
(c) Ho: p=.55 vs. H,: p> .55, n= 390, p = .57, a= .01 


Ans. (a) z* = 0.63, p value = .5286, Do not reject the null hypothesis. 
(b) z* = -2.44, p value = .0073, Reject the null hypothesis. 
(c) z* = 0.79, p value = .2148, Do not reject the null hypothesis. 


Chapter 10 


Inferences for Two Populations 


SAMPLING DISTRIBUTION OF X, — X, FOR LARGE INDEPENDENT SAMPLES 


Two samples drawn from two populations are independent samples if the selection of the sample 
from population | does not affect the selection of the sample from population 2. The following 
notation will be used for the sample and population measurements: 


\l, and {, = means of populations | and 2 
6; and 6, = standard deviations of populations | and 2 
n,; and n2 = sizes of the samples drawn from populations | and 2 (n,; 2 30, nz = 30) 
xX, and xX, = means of the samples selected from populations | and 2 
s,; and s, = standard deviations of the samples selected from populations | and 2 


When two large samples (n, 2 30, n2 > 30) are selected from two populations, the sampling 
distribution of XK, - x, is normal. The mean or expected value of the random variable x, — X, is 
given by formula (/0./): 


E(x, — x, )= Hex, =Hi- bo (10.1) 


The standard error of xX, ~ X, 1s given by 
2 


Figure 10-1 illustrates that x, — X, is normally distributed and centers at , — jt). The standard error 
of the curve is given by formula (/0.2). 


Hy, — H2 
Fig. 10-1 
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Formula (/0..3) ts used to transform the distribution of x, — x, to a standard normal distribution. 


ge Bi Me MM) (10.3) 
Ox, -*, 


EXAMPLE 10.1 The mean height of adult males is 69 inches and the standard deviation is 2.5 inches. The 
mean height of adult females is 65 inches and the standard deviation is 2.5 inches. Let population | be the 
population of male heights, and population 2 the population of female heights. Suppose samples of 50 each are 
selected from both populations. Then X, — xX, is normally distributed with mean W_ _ = WW) ~ W) = 69 - 65 = 4 


R|-%2 


oj; 03 6.25 6.25 = 

inches and standard error equal to o;,-5, = ,J—+— = yo —t+—7 = 5 inches. The probability that the 

X[K p y 
ny no 50 50 


mean height of the sample of males exceeds the mean height of the sample of females by more than 5 inches is 

represented by P(X, — x, > 5). The event xX; — X, > 5 is converted to an equivalent event involving z by using 

Soe eran 
Ox, -% 5 

standard normal distribution table the area to the right of 2 is .5 — .4772 = .0228. Only 2.28% of the time would 

the mean height of a sample of 50 male heights exceed the mean height of a sample of 50 female heights by 5 or 

more inches. 


formula (/0.3). The z value corresponding to X; — X, = 5 is z= 


ESTIMATION OF p, - 2 USING LARGE INDEPENDENT SAMPLES 


The difference in sample means, X, — X,, tS 4 point estimate of [) — U2. An interval estimate for 


Ll; — 2 is obtained by using formula (/0.3). Since 95% of the area under the standard normal curve 
is between -1.96 and 1.96, we have the following probability statement: 
x, —( 


PU-1,.96< ALT TH THD) & 1 96) = 95 
Ox,-%, 


Solving the inequality inside the parenthesis for LW; — 2, we obtain the following 95% confidence 
interval for ,) — bo. 


(K, -— %)) £1.96 ogg, 


The general form of a confidence interval for [ — W is given by formula (/0.4), where z is 
determined by the specified level of confidence. The z values for the most common levels of 
confidence are given in Table 10.1. 


(X,-%)t2x6 (10.4) 


X-*2 


Table 10.1 
Confidence level 


EXAMPLE 10.2 Keyhole heart bypass was performed on 48 patients and conventional surgery was performed 
on 55 patients. The time spent on breathing tubes was recorded for all patients and is summarized in Table 10.2. 
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The difference in sample means is X, — X, = 3.0 — 15.0 = -12.0. Since population standard deviations are 
unknown, they are estimated with the sample standard deviations. The estimated standard error of the 


. s> gh «42.25 13.69 
difference in sample means is represented by §;,-3, = 4J—-+ ~~ = yo t+ = 544, 


Table 10.2 


2. Conventional 55 15.0 hours 3.7 hours 

A 90% confidence interval for 1, — LH is found by using formula (/0.4). The z value is determined to be 1.65 
from Table 10.1. The interval is -12.0 + 1.65 x .544, or -12.040.9. The interval extends from -12.0 - 0.9 = 
-12.9 to -12.0 + 0.9 =-11.1. That is, we are 95% confident that the keyhole procedure requires on the average 
from 11.1 to 12.9 hours less time on breathing tubes. Note that when population standard deviations are 


unknown and the samples are large, that the sample standard deviations are substituted for the population 
standard deviations. 


TESTING HYPOTHESIS ABOUT yp, - pp, USING LARGE INDEPENDENT 
SAMPLES 


The procedure for testing hypothesis about the difference in population means when using two 
large independent samples ts given in Table 10.3. 


Table 10.3 


Steps for Testing a Hypothesis Concerning 4, — [t2: Large Independent Samples 


Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: Wy - 
12 = Dp, and the research hypothesis is of the form H,: Hy — Wy # Do or Ha: Wy — H2 < Do or Hy: Hy - U2 > Do. 


Step 2: Use the standard normal distribution table and the level of significance, 0, to determine the 
rejection region. 


Xi —-X,—-Da 


Ox,-% 


Step 3: Compute the value of the test statistic as follows: z* = , where x, — Xx, Is the 


computed difference in the sample means, Do is the hypothesized difference in the population means as 


ey: 
: ; ‘ o of : ; ae 
given in the null hypothesis, and o;,-;, = —1+4—* or if the population standard deviations are unknown, 
Bye img 


2 2 
Is $2. ; 
Sx\-%, = “| 42 js used as an estimate of On,-2- 
My Tg 


Step 4: State your conclusion. The null hypothesis is reyected if the computed value of the test statistic falls 
in the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 10.3 Keyhole heart bypass was performed on 48 patients and conventional surgery was performed 
on 55 patients. The length of hospital stay in days was recorded for each patient. A summary of the data are 
given in Table 10.4. 


Table 10.4 


Standard deviation 


1. Keyhole 48 3.5 days 1.5 days 
2. Conventional o)s) 8.0 days 2.0 days 
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These results are used to test the research hypothesis that the mean hospital stay for keyhole patients is less than 
the mean hospital stay for conventional patients with level of significance @ = .01. The null hypothesis is Ho: 
Ll; — 2 = 0 and the research hypothesis is Ha: fy ~ [2 < 0. This is a lower-tailed test, and the critical value is 
~2.33, since P(z < -2.33) = .01. This critical value is determined in exactly the same way as it was in chapter 9. 
The null hypothesis is rejected if the computed test statistic is less than the critical value. The following 
quantities are needed to compute the valuc of the test statistic: Dy = 0, X, - X, = 3.5 - 8.0 = -45, and the 


; Sf Bs 225 4 eee 
estimated standard error is Sz,-z, = 4J—- + — = 4J-—— + — _= .346. The computed test statistic ts: 
nN; Mo 48 55 


= -13.0 


The null hypothesis is rejected and it is concluded that the mean hospital stay for the keyhole procedure is less 
than that for the conventional procedure. 


SAMPLING DISTRIBUTION OF X, —- X, FOR SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) 
STANDARD DEVIATIONS 


When the sample sizes are small (one or both less than 30) and the population standard 
deviations are unknown, the substitution of sample standard deviations for the population standard 
deviations, as was done in Examples 10.2 and 10.3, is not appropriate. The use of the standard 
normal table for confidence intervals and testing hypothesis is not valid in this case. Statisticians 
have developed two different procedures for the case where the sample sizes are small and the 
population standard deviations are unknown. Both cases require the assumption that both populations 
have normal distributions. In this section it will also be assumed that the populations have equal 
standard deviations. 

A statistical test for deciding whether to assume 6, = GO) or 0, # G2 utilizes the two sample 
standard deviations and a statistical distribution called the F distribution. However, a rule of thumb 
used by some statisticians states that if .5 S$ = < 2, then assume 6, = 6). Otherwise, assume that 


J 


GO; # O? 
Suppose G is the common population standard deviation, i. e., 6; = G) = G. The standard error of 
Of . 03 
X, — X, is given by formula (/0.2) as o6;,-;, = —1+4=2 Replacing the two population standard 
: Ay TQ 


deviations by 6 and factoring it out, we get formula (/0.5): 


i 
On-%2 = [+4] (10.5) 


Now, o°. the common population variance, is estimated by pooling the sample variances as a 
weighted average. The pooled estimator of o is represented by S’ and is given by 


_ (ny Dstt (n2- si 


o 
nytn2-2 


(10.6) 


Replacing o° by S’ in formula (/0.5), the estimated standard error of the difference in the sample 
means is obtained and is given in formula (/0.7) 
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Sx)-%2 = s(4++2) (10.7) 


Replacing o,-, by S;,-x, in formula (/0.3), we obtain 


peo Pe ea) (/0.8) 


Sx) -%2 


When sampling from two normally distributed populations having equal population standard 
deviations, the statistic given in formula (/0.8) has a t distribution with degrees of freedom given by 


df=n, +n.-2 (10.9) 


EXAMPLE 10.4 A sample of size 10 is randomly selected from a normally distributed population and a 
second sample of size 15 1s selected from another normally distributed population. The standard deviation of the 
first sample is equal to 13.24 and the standard deviation of the second sample ts equal to 24.25, The rule of 
thumb, given above, suggests that the populations may be assumed to have a common standard deviation since 


S 
SS += 5552.A pooled estimate of the common variance is given as follows: 
S2 
gs (n; — I)st + (n2—- 1)83 Z 9x 13.24° +14 24.25 = 426.5458 
nmtn2-2 10+ 15-2 


and a pooled estimate of the common standard deviation is S = 44265458 = 20.65. The estimated standard 
error of the difference in sample means is as follows: 


§? 1.4) . ap6s4s8x{ +1] = 8.43 
ny no 10 15 


The computations of S and §,,-;, are required for setting confidence intervals on (1; — 2 and for testing 


hypothesis concerning 1; — 2 when the two samples are small, and the population standard deviations are 
unknown but assumed equal. 


ESTIMATION OF py - pb, USING SMALL INDEPENDENT SAMPLES 
FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) 
STANDARD DEVIATIONS 


The difference in sample means, X, — X,, is a point estimate of UW — U2. An interval estimate for 
4; — H2 1s obtained by using formula (/0.8). When independent small samples are selected from two 
normal populations having unknown but equal standard deviations, the general form of a confidence 
interval for pt, — Hy is given by formula (/0./0), where t is obtained from the t distribution table and 
is determined by the level of confidence and the degrees of freedom which is given by df = n, + n2 — 
2. The standard error of the difference in sample means, S,,-;,, iS given by formula (/0.7). 


(X,- X,)ttx $,-5, (10.10) 
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EXAMPLE 10.5 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed 
on 10 patients. The time spent on breathing tubes was recorded for all patients and ts summarized in Table 10.5. 
The difference in sample means is X, - X, = 3.0 — 7.0 = -4.0 hours. The pooled estimate of the common 
population variance 1s obtained as follows: 


> (m—Nstt+(n2- 1s} 7x 2.254+9x7.29 


S = 
nitn2-2 8+10-2 


= 5.085 


The standard error of the difference in sample means is: 


1 l l l 
2 | 
See —+—} = ,/5.085x| —+ = 1.070 
GMs Ve (2 “| c 7 


Table 10.5 


| Sample | Sample size | Mean | Standard deviation 


|. Keyhole 3.0 hours 1.5 hours 
2. Conventional 7.0 hours 2.7 hours 


To find a 95% confidence interval for p; - U2 , we need to find the t value for use in formula (/0./0). The 
degrees of freedom is df = n, + n2 - 2 = 8 + 10 - 2 = 16. From the ¢ distribution table having 16 degrees of 
freedom, we find the following: P(-2.120 < t < 2.120) = .95. The proper value of t is therefore 2.120. The 
margin of error ts t x S;,-;, = 2.120 x 1.070 = 2.2684 or 2.27 to 2 decimal places. The 95% confidence 


interval is -4.0 + 2.27. Assuming that the times spent on breathing tubes are normally distributed for both 
procedures, the difference 1, — Hz is between —6.27 and —1.73 hours with 95% confidence. 


TESTING HYPOTHESIS ABOUT | - pf. USING SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) 
STANDARD DEVIATIONS 


The procedure for testing a hypothesis about the difference in population means when using two 
small independent samples from normal populations with equal (but unknown) standard deviations is 
given in Table 10.6. 


Table 10.6 


Steps for Testing a Hypothesis Concerning [, — 2: Small Independent Samples from 
Normal Populations with Equal (but Unknown) Standard Deviations 


Step |: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: 1) - p2 
= Dp and the research hypothesis is of the form H,: ) — H2 # Do or Ha: fy — Hy < Do or Hy: Hy — H2 > Do. 


Step 2: Use the t distribution table with degrees of freedom equal to n; + nz — 2 and the level of significance, 

a, to determine the rejection region. 

x, —X, —-Do 
Si, -%; 

difference in the sample means, Do is the hypothesized difference in the population means as stated in the null 


Step 3: Compute the value of the test statistic as follows: (* = , where X, — X, 1s the computed 


Ki %2 
ny n?2 njtn2~2 


] | =] a2 =] 2 
hypothesis, and § = [tet] , where S’ = (i= Dst + (n2= sa 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in 
the rejection region. Otherwise, the null hypothesis is not rejected. 
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EXAMPLE 10.6 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed 
on 10 patients. The length of hospital stay in days was recorded for each patient. A summary of the data 1s given 
in Table 10.6. 


Table 10.6 


Standard deviation 


1. Keyhole 3.5 days 1.5 days 
2. Conventional 8.0 days 2.0 days 


These results are used to test the research hypothesis that the mean hospital stay for keyhole patients is less than 
the mean hospital stay for conventional patients with level of significance a = .01. The null hypothesis is Ho: 
jy — 2 = O and the research hypothesis is H,: 1; — 2 < 0. To determine the rejection region, note that the test is 
a lower-tailed test and the degrees of freedom is df =n, + nz - 2=8 + 10-2 = 16. Using the t distribution table, 
we find that for df = 16, P(t > 2.583) = .01, and therefore, P(t < -2.583) = .01. The shaded rejection region is 
shown in Fig. 10-2. 


rejection region 
Area = .01 


-—2.583 


Fig. 10-2 


The null hypothesis ts to be rejected if the computed test statistic is less than -2.583. The computed test statistic 
is determined according to step 3 in Table 10.6. The difference in sample means is X, - X, = 3.5 —- 8.0 = -4.5. 
The value specified for Dg is 0. The pooled estimate of the common population variance is 


and the standard error of the difference in sample means is 


eee 1 1 
Sgi-x. = 4{9°] ~+— | = {3.2344 x] —+— | =.853 
n) m2 8 10 


The computed test statistic is 
pee eae) aaa 
Sz, -%; 853 


Since t* is less than the critical value, the null hypothesis is rejected and it is concluded that mean length of 
hospital stay is smaller for the keyhole procedure than the conventional procedure. This hypothesis test 
procedure assumes that the two populations are normally distributed. 


EXAMPLE 10.7 A sociological study compared the dating practices of high school senior males and high 
school senior females. The two samples were selected independently of one another. The number of dates per 
month were recorded for each participant. The data are shown in Table 10.7. 
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Table 10.7 


| Males__—|__—Females 


The assumptions of equal variances and normal populations are usually checked out with statistical software 
before the actual test of equality of means is carried out. This has been made much easier because of the wide- 
spread availability of statistical software. The checking of assumptions as well as the actual test of equality of 
population means for the data in Table 10.7 will be illustrated by using Minitab. The commands % normplot 
mdates and % normplot fdates produced the following plots, which may be used to check for normality. 


Normal Probability Plot 


.999 


.99 
95 


.80 


Probability °° 
20 


.05 
O01 


.001 


Average: 8.6 Anderson-Darling Normality Test 


StDev: 2.54733 A-Squared: 0.199 
N: 10 P Value: 0.840 
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Normal Probability Plot 


.999 


.99 
95 


.80 


Probability °° 
20 


.05 
01 


.001 


4 9 14 
fdates 


Average: 9.2 Anderson-Darling Normality Test 


StDev: 2.85968 A-Squared: 0.147 
N: 10 P Value: 0.948 


If the sample data are selected from a normal population, the points on the normal probability plot will tend to 
fall along a straight line. Normality of the population is usually rejected if the p value is less than @ = .05. For 
the above data, the p values are 0.840 and 0.948 and normality of the populations is not rejected. 


An edited version of the Minitab procedure to test for equal population variances is shown below. A | in 


column | indicates the response in c2 came from the male group and a 2 indicates the response came from the 
female group. The assumption of equal variances is usually rejected if the p value shown in Bartlett's Test 
(normal distribution) is less than .05. Since the p value is 0.736, the assumption of equal variances is not 


rejected. 
Cl C2 
1 1 8 
2 1 11 
oN 
4 1 5 
5 1 13 
6 | 10 
7 1 «10 
8 1 8 
9 1 9 
10 1 5 
lt 2: 7 
12 2 7 
130 2 12 
14 2 11 
IS 2 14 
16 2 10 
17 2 10 
18 2 9 
19 2 4 

2 
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MTB > %vartest c2 cl 
Homogeneity of Variance 
Response C2 

Factors Cl 


Bartlett's Test (normal distribution) 
Test Statistic: 0.114 
P value : 0.736 


Having checked out the normality assumptions and the equal population variances assumption, the test 
procedure is now conducted. 


MTB > twot data In c2, groups incl; 
SUBC > alternative is 0; 
SUBC > pooled procedure. 


Two Sample T-Test and Confidence Interval 
Two sample T for dates 
Sample N Mean St Dev SE Mean 

] 10 8.60 2.55 0.81 

2 10 9.20 2.86 0.90 


95% CI for mu (1) — mu (2): (-3.14, 1.94) 
T-Test mu (1) = mu (2) (vs not =): T= -0.50 P= 0.63 DF = 18 


To use the TWOT command in Minitab, the responses must be in a column and the group from which the 
responses came must be in another column. The data are entered in columns | and 2 as shown above. The null 
hypothesis is Ho: [) - 2) = 0. The alternative command indicates the nature of the research hypothesis. A —1 
indicates a lower-tailed test, a 0 indicates a two-tailed test, and a 1 indicates an upper-tailed test. The 
subcommand pooled procedure indicates that equal population standard deviations are assumed and a pooled 
estimated is to be computed. 

The output gives the sample size, mean, standard deviation, and standard error of the mean for both groups 
separately. A 95% confidence interval for [4, — f2 is given. The computed test statistic is t* = -0.50. The two- 
tailed p value ts 0.63. The degrees of freedom is 18. The pooled standard deviation estimate is S = 2.71. 


SAMPLING DISTRIBUTION OF X, — X, FOR SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL 
(AND UNKNOWN) STANDARD DEVIATIONS 


When the population standard deviations are unknown and it is not reasonable to assume they are 
equal, the sample variances are not pooled as they were in the previous three sections. Statisticians 
have developed a test statistic for this case, which has a distribution, which is approximately a t 
distribution. The standard error of the difference in the sample means is approximated by substituting 
the sample variances for the population variances in formula (/0.2) as was done in the large sample 
case to obtain the estimated standard error given in the following: 


2 
Sam FYyot— (10.11) 


In formula (/0.3), we replace o,,-;, by the expression for §,,-;, in formula (/0.//), and obtain the 
statistic shown tn formula (/0. 1/2): 
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When sampling from two normally distributed populations having unequal population standard 
deviations, the statistic given in formula (/0./2) has an approximate t distribution and the degrees of 
freedom is given by 


df = minimum of {(n; — 1), (m2 - 1)} (10.13) 


EXAMPLE 10.8 A sample of size 13 is taken from a normally distributed population, and a sample of size 15 
is taken from a second normally distributed population. The mean of population 1 is 70 and the mean of 
population 2 is 65. The population variances are unknown, but assumed to be unequal. The following statistic 
has an approximate ¢ distribution. 


Si -% 


The degrees of freedom is df = minimum{ 12, 14} = 12. 


ESTIMATION OF ; - p22 USING SMALL INDEPENDENT SAMPLES 
FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) 
STANDARD DEVIATIONS 


The difference in sample means, X, — X,, is a point estimate of (ty — Uy. An interval estimate for 

1 — [2 1s obtained by using formula (/0./2). When independent small samples are selected from two 
normal populations having unknown and unequal standard deviations, the general form of a 
confidence interval for 1; — 2 ts given by formula (/0.14), where t 1s obtained from the t distribution 
table and is determined by the level of confidence and the degrees of freedom which is given by df = 
minimum of {(n; — !), (mz — 1)}. The standard error of the difference in sample means, §,,-;,, 18 
given by formula (/0./7/). 

EXAMPLE 10.9 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed 
on 10 patients. The time spent on breathing tubes was recorded for all patients and is summarized in Table 10.8. 
The difference in sample means is X, — X, = 3.5 — 5.0 = -1.5 hours. Because the ratio of the sample standard 


deviation for sample 2 divided by sample | is 3.4, it is not reasonable to assume that 6; = 6). 


Table 10.8 


| Sample | Sample size | Mean | Standard deviation 


1. Keyhole 3.5 hours 0.5 hour 
2. Conventional 5.0 hours 1.7 hours 


To find a 95% confidence interval for uw, — 2, we need to find the t value for use in formula (70.74). The 
degrees of freedom is df = minimum of {(n, — 1), (n) — 1)} = minimum {7, 9} = 7. From the t distribution table 
having 7 degrees of freedom, we find the following: P(—2.365 < t < 2.365) = .95. The proper value of t is 


2 
therefore 2.365. The standard error of the difference in sample means is S;,-;, = ay, Using the sample 
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.25 2.89 
standard deviations in Table 10.8, we find that S;,-;, = aa = .566. The margin of error is t x S;,-5, = 
10 


2.365 x .566 = 1.3386 or 1.34 to 2 decimal places. The 95% confidence interval is -1.5 + 1.34. Assuming that 


the times spent on breathing tubes are normally distributed for both procedures, the difference 1) - [2 is 
between —2.84 and —0.16 hours with 95% confidence. 


EXAMPLE 10.10 Mothers Against Drunk Driving (MADD) have pushed for lowering the limit for driving 
while intoxicated from .1% to .08%. A study of alcohol-related traffic deaths compared the blood alcoho! levels 
of individuals involved in such accidents in states with a 0.08% limit with those in states with a .1% limit. The 
results are shown in Table 10.9. Minitab will be used to set a 95% confidence interval on pt, — M2 , where pA, is 
the mean bloud-alcohol level of such individuals in states having a .08% limit and [2 is the mean blood-alcohol 
level in states having a .1% limit. 


Table 10.9 


0.08% level .1% level 


Normal probability plots, such as those in Example 10.7, indicate that it is reasonable to assume that the 
samples are taken from normally distributed populations. An edited version of the Minitab test for equal 
population variances is shown below. Before executing the Minitab command %vartest c2 cl, the data in Table 
10.9 are set in columns cl and c2, where cl contains a | or a 2, depending on where the sample value in c2 
comes from. The set up for the columns is similar to that shown in Example 10.7. Bartlett’s Test 1s used to test 
the null hypothesis Ho: The population standard deviations are equal vs. the alternative hypothesis H,: The 
population standard deviations are not equal. The hypothesis of equal standard deviations is usually rejected if 
the p value for Bartlett's Test is less than .05. In this case, the p value is equal to 0.025, and equal population 
standard deviations is rejected. 


MTB > %vartest c2 cl 


Homogeneity of Variance 
Response C2 

Factors Cl 

ConfLvl 95.0000 

Bartlett's Test (normal distribution) 
Test Statistic: 4.998 

P value : 0.025 


Since it is reasonable to assume that the populations have normally distributed populations with unequal 
population standard deviations, the techniques of this section are appropriate. The command twot data in c2 
groups in cl produces a 95% confidence interval for 4, — 2. If the population standard deviations were 
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assumed to be equal, the subcommand SUBC> pooled procedure would be added to the twot command. The 
Minitab output gives the sample size, mean, standard deviation, and standard error for both groups. The 95% 
confidence interval for 1, — 2 extends from —0.0829 to -0.014. The edited output is as follows: 


MTB > twot data in c2 groups in cl 
Two Sample T-Test and Confidence Interval 


Two sample T for C2 

Cl N Mean St Dev SE Mean 

] 15 0.0820 0.0300 0.0078 

2 15 0.1307 0.0562 0.015 
95% CI for mu (1) — mu (2): (—0.0829, -0.014) 


TESTING HYPOTHESIS ABOUT 1; - 4. USING SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL 
(AND UNKNOWN) STANDARD DEVIATIONS 


The procedure for testing a hypothesis about the difference in population means when using two 
small independent samples from normal populations with unequal (and unknown) standard 
deviations is given in Table 10.10. 

Table 10.10 


Steps for Testing a Hypothesis Concerning w; — 2: Small Independent Samples from 
Normal Populations with Unequal (and Unknown) Standard Deviations 


Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: Hy, - be 
= Do and the research hypothesis is of the form H,: Hy ~ Hz # Do or Hy: By — M2 < Do or Ha! fy — 2 > Do. 


Step 2: Use the t distribution table with degrees of freedom equal to df = minimum of {(n, — 1), (nz — 1)} and 
the level of significance, @, to determine the rejection region. 


dead D ee. Feet 
Step 3: Compute the value of the test statistic as follows: t* = * | where X, ~ X, 1s the computed 


X, ~hs 
difference in the sample means, Dp is the hypothesized difference in the population means as stated in the null 
2 
8} 


2 

: S 
hypothesis, and §;,-;,=4J—+—. 
nmr nz 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in 
the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 10.11 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed 
on 10 patients. The length of hospital stay in days was recorded for each patient. A summary of the data is given 
in Table 10.11. The null hypothesis Ho: uw, — 2 = —-4 vs. Hy: Hy — Wo # —4 is of interest to the researchers 
involved in the study. In words, the null hypothesis states that the mean hospital stay is 4 days less for the 
keyhole procedure than for the conventional procedure. The research hypothesis states that the difference in 
means is not equal to 4 days. Since the sample standard deviation for the conventional procedure is 2.5 times the 
sample standard deviation for the keyhole procedure, it is assumed that population standard deviations are 
unequal. The sample observations (which are not shown) indicate that it is reasonable to assume that both 
populations are normally distributed. 


Table 10.11 


Standard deviation 


1. Keyhole 3.5 days 1.2 days 
2. Conventional 8.0 days 3.0 days 
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The degrees of freedom for the t distribution is df = minimum of {(n, ~ 1), (nz — 1)} = minimum of {7, 9} = 7. 
For a = .01, an area of .005 is allocated to each tail, since the research hypothesis is two-tailed. Using the t 
distribution with 7 degrees of freedom, it is found that P(t > 3.499) = .005. The critical values are + 3.499. The 
estimated standard error of the difference in sample means is : 


The computed test statistic is: 


pee S127 Do _ 35-80-C4) _ gg 


Since t* is between —3.499 and 3.499, the evidence is not sufficient to reject the null hypothesis. 


EXAMPLE 10.12 The data in Table 10.9 are used to test the research hypothesis H,: l, — 2 < 0. The 
statement |; — [2 < 0 is equivalent to {1; < fy. The statistical statement 1; < fH, states that the mean blood-alcohol 
level for individuals involved in alcohol-related traffic deaths is lower in states with a .08% limit than the mean 
level in states with a .1% limit. In Example 10.10, it was shown that it was reasonable to assume normal 
population distributions and unequal population standard deviations. After putting the data in columns cl and c2 
as described in Example 10.10, the following Minitab output was obtained. 


MTB > twot data in c2 groups incl; 
SUBC > alternative = —-1. 


Two Sample T-Test and Confidence Interval 


Two sample T for C2 

Ci N Mean StDev SE Mean 
l 15 0.0820 0.0300 0.0078 
2 15 0.1307 0.0562 0.0150 


95% CI for mu (1) — mu (2): (0.0829, —0.014) 
T-Test mu (1) = mu (2) (vs <): T= —2.96 P=0.0038 DF= 21 


The subcommand, SUBC> alternative = —-1, indicates that a lower-tailed test is required. Note that the research 
hypothesis is supported at the @ = .05 level, since the p value is 0.0038. The computed test statistic is shown to 
be T = —2.96 and the degrees of freedom is shown as 21. The test statistic is computed as shown in Table 10.10. 
The degrees of freedom is not computed by df = minimum of {(n, - 1), (nz - 1)} = minimum of {14, 14} = 14. 
The degrees of freedom computation is given by a more complicated formula. The use of df = minimum of 
{(n; - 1), (M2 — 1)} gives a more conservative test than that used by Minitab. Both ways of computing the 
degrees of freedom may be found in various textbooks. 


SAMPLING DISTRIBUTION OF d FOR NORMALLY DISTRIBUTED 
DIFFERENCES COMPUTED FOR DEPENDENT SAMPLES 


The previous sections in this chapter have dealt with independent samples selected from two 
populations. When sample values are purposely matched or paired, the samples are called dependent 
samples. The samples are also referred to as paired or matched samples. Dependent samples include 
pairs of measurements such as before/after measurements made on the same person or machine, pre- 
and post-test scores taken on the same person, similar measurements made on twins who have 
undergone different treatments, and so forth. Table 10.12 illustrates a typical set of paired data. The 
differences are formed by subtracting the second sample value from the first. 
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Table 10.12 


I 


The estimation and testing procedures for experiments involving paired samples involve the 
differences shown in Table 10.12. As in previous sections, we shall investigate the sampling 
distribution of the mean of the sample differences first, and then apply these results to establish 
confidence intervals and test hypothesis concerning the mean of the population of all possible 
differences. The following notation will be used for inferences involving dependent samples. 


Hig = the mean of the population of paired differences 

Oy = the standard deviation of the population of paired differences 

d =the mean of the paired differences which are computed from the samples 

Sa = the standard deviation of the paired differences which are computed from the samples 


n =the number of paired differences which are computed from the samples 


Oo, = the standard error of d 
S, =the estimated standard error of d 


The sample mean of paired differences is given by 


2d 
n 


d= (10.15) 


The sample standard deviation of paired differences is given by 


2 2 
Sy = eel (/0.16) 
n—- 


The estimated standard error of d is given by 


5, = 3 (10.17) 


on 


When the population of differences is normally distributed, the statistic given in formula (/0./8) has 
at distribution with (n — 1) degrees of freedom. 


ta Gos (10.18) 


This result is perfectly analogous to the previous result given in Chapters 8 and 9; namely the result 
that the statistic given in formula (/0./9) has a t distribution with df = n — 1 when the sample values 
comprising X come from a normally distributed population having mean equal to pL. 
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pee (10.19) 
S. 


x 


The confidence interval for [ig as well as the procedure for testing hypothesis about [Wy are analogous 
to the techniques given in Chapters 8 and 9 for estimating the mean of one population or testing an 
hypothesis about the mean of a population. The only difference between the techniques is that in the 
case of dependent samples, the procedures are applied to differences. 


EXAMPLE 10.13 A digital Sphygmomanometer gives diastolic blood pressure readings that are 5S units higher 
on the average than those obtained by the traditional method used by most medical professionals. Suppose 20 
individuals have their diastolic blood pressure taken both ways. Let x represent the readings obtained by using 
the digital Sphygmomanometer, and let y represent the readings obtained by the traditional method. The 
differences in readings are found by the formula d = x — y. The mean population difference is uy = 5 units. 
Assuming that the population of all differences is normally distributed, the following statistic has a t distribution 
with df = 20 ~ | = 19: 

_d-5 


4 


t 


ESTIMATION OF tg USING NORMALLY DISTRIBUTED DIFFERENCES 
COMPUTED FROM DEPENDENT SAMPLES 


When a paired sample of size n is selected and differences are computed as shown in Table 
10.12, a confidence interval for the population mean difference, 24, can be formed by using the 
sampling distribution of the statistic given in formula (/0./8). The confidence interval is given by 
formula (/0.20), where t is determined by the level of confidence and df = n — 1. The standard error 


of d is computed by using formulas (/0./6) and (/0.17). 


dttxs (10.20) 


a 
EXAMPLE 10.14 A psychological study compared the amount of manganese in tears that lubricate the eye 
with the amount in emotional tears. A measurement, which is related to the amount manganese, was taken for 
each type of tear on 10 different subjects. The results are shown in Table 10.13. The computations needed to set 
a 99% confidence interval on ly are given below the table. 


] 


Table 10.13 


Lubricating 
tears 
10 
13 


Cowman nr nn & WN 


— 
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The sum of the differences is Xd =-2+1+2 -~1+2-1-1-3-2+2=-3 and the sum of the squares of the 


differences is Xd? =44+1+4+1+4+1+1+49+44+44 = 33. The mean difference is d =-.3. The sample 
standard deviation for the differences is 


2 say? 33-2 
s= (20) EN) 2 2 10) 2 1.889 
n-l 9 


The standard error of d is 


The t value for a 99% confidence interval is found as follows: The degrees of freedom is df = 10 - 1 = 9, From 
the t distribution table for 9 degrees of freedom, we find that P(-3.250 < t < 3.250) = .99. The 99% confidence 
interval is found by using formula (/0.20) to be: -.3 + 3.25 x .597, or the 99% confidence interval is (-2.24, 
1.64). The confidence interval is valid provided the population of differences is normally distributed. 


EXAMPLE 10.15 A Minitab solution to Example 10.14 is given below. The lubricating tear data is put in 
column 1 and the emotional tear data is put in column 2. The command let c3 = cl - c2 computes the 
differences and puts them in column 3. The command name c3 ‘diff’ assigns the name diff to column 3. The 
command tinterval 99% confidence data in c3 computes the 99% confidence interval for Hy. Note that 
Minitab computes and prints out the mean difference, the sample standard deviation of differences, the standard 
error of d, as well as the 99% confidence interval. These quantities are the same as those computed by hand in 
Example 10.14. 


Data Display 

Row lubricate emotion 
1 10 12 
2 13 12 
3 11 9 
4 10 1] 
5 9 7 
6 8 9 
ej 10 1] 
8 10 13 
9 8 10 
10 12 10 


MTB > let c3 = cl —c2 
MTB > name c3 ‘diff 
MTB > tinterval 99% confidence data in c3 


Confidence Intervals 
Variable N Mean StDev SE Mean 99.0 % CI 
diff 10 -0.300 1.889 0.597 (-2.241, 1.641) 


TESTING HYPOTHESIS ABOUT tg USING NORMALLY DISTRIBUTED 
DIFFERENCES COMPUTED FROM DEPENDENT SAMPLES 


The procedure for a hypothesis test about the mean difference for paired samples is given in 
Table 10.14. 
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Table 10.14 


Steps for Testing a Hypothesis Concerning ty: Normally Distributed Differences 
Computed from Dependent Samples 


Step I: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: My = 
Do and the research hypothesis ts of the form H,: py # Dg or Ha: Wy < Do or Ha: py > Do. 


Step 2: Use the t distribution table with degrees of freedom equal to df = n— 1 and the level of significance, 
a, to determine the rejection region. 
d~D — 

, where d = ——, Do 1s the 
Si ie 


S [za ~(Zd)?/ 
hypothesized value of pig in the null hypothesis, Sg = Te and Sg = at : 
n n- 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls 
in the rejection region. Otherwise, the null hypothesis is not rejected. 


Step 3: Compute the value of the test statistic as follows: t* = 


EXAMPLE 10.16 A sociological study concerning marriage and the family was conducted and one of several 
factors of interest was the educational level of husbands and wives. Table 10.15 gives the number of years of 
education beyond high school for 15 couples. These data were used to test the null hypothesis that the mean 
educational levels are the same vs. the research hypothesis that the educational levels are different at level of 
significance & = .05. 


Table 10.15 


Z 


nN 
Hook 


SCMIKRMWNAWNR 


phaco-noan 


H—ROMAN—WAKNANO HE 


6 
4 
3 
4 
4 
l 
0 
2 
4 
Z 
4 
0 
4 


—— et ee em 
A & WwW bh 


I 
ws) 


The degrees of freedom is df = 15 - | = 14, and since the research hypothesis is two-tailed and @ = .05, at 
distribution curve with area equal to .025 in each tail will determine the rejection regions. Consulting the t 
distribution table with df = 14, we find that P(t > 2.145) = .025. Figure 10-3 illustrates the rejection regions. 
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rejection region 
Area = .025 


rejection region 
Area = .025 


—2.145 2.145 


Fig. 10-3 


pipe 


The sum of the differences is Ed = 8 and the sum of the squares of the differences is Xd’ = 120. The mean 


difference is d = 


8 
= eS .533, and the standard deviation is 


[p42 _ 2 on _e2 
Sa = id a /n _ {120 ~ 115 = 7.8752 
n- 


n 


The standard error of the mean difference is 


Sa 2.8752 


The computed test statistic is 


Since t* = 0,72 does not fall within the rejection region shown in Fig. 10-3, we are unable to conclude that the 
educational levels of husbands and wives are different. 


EXAMPLE 10.17 A Minitab solution for Example 10.16 is shown below. The data for the husbands is put into 
column cl and the data for the wives is put into column c2. The differences are put into c3, and a one-sample t 
test is performed on c3. The subcommand SUBC> alternative = 0. indicates a two-tailed alternative hypothesis. 
Note that the mean, standard deviation, standard error, and computed t values are the same as those computed in 


Example 10.16. Since the computed p value is 0.48, the null hypothesis ts not rejected at the a = .05 level. 


Data Display 
Row husband wife 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
1] 
12 
13 
14 
15 


4 2 
0 4 
7 6 
4 4 
2 3 
8 4 
4 4 
3 
1 0 
3 2 
4 4 
8 2 
0 4 
4 0 
4 
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MTB > let c3 =cl —c2 

MTB > name c3 ‘diff 

MTB > ttest mean = 0, data in c3; 
SUBC > alternative = 0. 


T-Test of the Mean 
Test of mu = 0.000 vs mu not = 0.000 


Variable N Mean StDev SE Mean T P 
diff 15 0.533 2.875 0.742 0.72 0.48 


SAMPLING DISTRIBUTION OF P,-P, FOR LARGE INDEPENDENT SAMPLES 


Previous sections in this chapter have discussed inferences concerning the differences in means 
for two populations. The remaining sections of Chapter 10 are concerned with inferences concerning 
the differences in population proportions or percents. How does the percent of defectives produced 
by machine ! compare with the percent of defectives produced by machine 2? Is there a difference in 
the percent of males who will vote for a presidential candidate and the percent of females who will 
vote for the candidate? Is the percentage of smokers the same for African-Americans as it is for 
Whites? All of these questions involve the comparison of percents or proportions. Sample 
proportions will be used to estimate the differences in population proportions or to test a hypothesis 
about the differences in population proportions. The following notation will! be used. 


p: and p2 = proportions in populations | and 2 having the characteristic of interest 
n, and n2 = sizes of the independent samples drawn from populations | and 2 
P, and Pp, = proportions in samples | and 2 having the characteristic of interest 
qi = | — p; and q2 = | — pz = proportions in populations | and 2 not having the characteristic 
q, = 1-P,andq,=1- p, = proportions in samples | and 2 not having the characteristic 


When the samples sizes, n; and nz, are such that nip; > 5, nop2 > 5, mq: > 5 and n2q2 > 5, the 
sampling distribution of Pp, ~ P, is normal. The mean or expected value of p, — Pp, is given by 
formula (/0.21): 

E(P, — P2) = H5,-p, = Pi P2 (10.21) 


The standard error of Pp, — P, is given by formula (/0.22). 


PiXq, , P2*q 
On-m = at ies 2 (10.22) 


When the sample sizes are large enough to satisfy the above requirements, the distribution of p, - P, 
is as shown in Fig. 10-4. The standard error of the curve is given by formula (/0.22). 
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Pi-P2 
Fig. 10-4 


Formula (/0.23) is used to transform the distribution of p, — Pp, to a standard normal distribution. 


z= Pia P2—(P~ P2) (10.23) 


O7,-52 


EXAMPLE 10.18 Three percent of the items produced by machine | have a minor defect and 2% of the same 
item produced by machine 2 have a minor defect. If samples of size 500 are selected from each machine, the 
distribution of P, — P, will be normal since nyp, = 500 x .03 = 15, nop = 500 x .02 = 10, nyqy = 500 x 97 = 
485 and n2q2 = 500 x .98 = 490. Normality may be assumed, since all four products are greater than 5. The mean 
value of P; — P, is .03 - .02 = .01, and the standard error of Pp, - P, is 


x x : . 
OF,-F, = Bie Ni, Pare: am 03 x 97 , 02 x 98 = 00987. 
n, n, 500 500 


The probability that the percent defective in the sample from machine | exceeds the percent defective in the 
sample from machine 2 by 3% or more is expressed as P(p, — Pp, > .03). The event p, - Pp, > .03 is 
transformed to an equivalent event involving z by the use of formula (/0.23). The equivalent event is 
Pi-By~ 01 | 03 ~ 01 

00987 00987 
therefore P(p, — P, > .03) = P(z > 2.03) = .5 — .4788 = .0212. There are only approximately 2 chances out of 
100 that the percent defective in sample | will exceed the percent defective in sample 2 by 3% or more. 


= 2.03, or z > 2.03. The event p, — P, > .03 is equivalent to the event z > 2.03, and 


ESTIMATION OF P, - P, USING LARGE INDEPENDENT SAMPLES 


The difference in sample proportions, Pp, - P,, is a point estimate of p; — p2. An interval estimate of 
P: — p2 is obtained by using formula (/0.23). However, the standard error as given in formula (/0.22) 
will need to be estimated since p; and p2 are unknown. If the population proportions are estimated by 
their corresponding sample proportions, we obtain the estimated standard error of p, — p,. The 
estimated standard error of p, — P, is represented by §,,-,, and is given by formula (/0.24). 


BXG, , Xa, 
n, n, 


(10.24) 


Sp,-r, = 


The confidence interval for p; — p2 1s given by formula (/0.25): 


(B,— By) + 2X Sp, (10.25) 
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The confidence interval is valid provided that nip; > 5, n2p2 > 5, niq; > 5 and noqo > 5. Since py, qu, 
p2 , and q2 are unknown, the corresponding sample quantities are substituted to check the validity of 
using the confidence interval. 


EXAMPLE 10.19 A survey of 2000 ninth-graders found that 32% had used cigarettes in the past week and a 
survey of 1500 high school seniors found that 35% had used cigarettes in the past week. Suppose p; represents 
the proportion of all ninth-graders who used cigarettes in the past week and p2 represents the proportion of all 
seniors who used cigarettes in the past week. A point estimate for p, — pz is -3%. The estimated standard error 
of P, — P is as follows: 


32x 68 . 35x 65 
ni m2 2000 1500 


yo USK 
S5,-B, = Puts Pa Sp = = 1.614% 


A 95% confidence interval for p; — p2 is given by (Pp, ~ P2) + zx Sp,-#, OF —3% + 1.96 x 1.614% or -3% + 


3.2%. The 95% confidence interval extends from -6.2% to 0.2%. Note that n, x P, = 2000 x .32 = 640, 
nm, xq, = 2000 x .68 = 1360, nz x BP, = 1500 x .35 = 525, and nz x G, = 1500 x .65 = 975. Since all 4 of these 
quantities exceed 5, the confidence interval is valid. 


TESTING HYPOTHESIS ABOUT P, - P,; USING LARGE INDEPENDENT 
SAMPLES 


The most common null hypothesis concerning p; — p2 is Ho: pi — pz = 0. Recall that in testing 
hypothesis, the null hypothesis is always assumed to be true and the test statistic is computed under 
this assumption. If the test statistic value is judged to be highly unusual, then the assumption that the 
null hypothesis is true is rejected. When Ho is assumed to be true, p; ~ p2 = 0 or p, = po. If we let p be 
the common value of p; and p2, then the standard error of p, — P, simplifies as follows: 


Pixa, PXa, _ [PXd Px _ px t+) 
m nz m n2 Mm ml 
Since p and q are unknown, they must be estimated from the two samples. Let x, be the number in 


sample | with the characteristic of interest and let x. be the number in sample 2 with the 
characteristic of interest. A pooled estimate of p is given by 


+ 
peo (10.26) 
ni+n2 
Substituting p for p and q = | ~ p for q in the above expression for o5,-,,, we obtain the 


estimated standard error of p, — P, as given in 


ee ok eee 
S3)-3; zi pqs (10.27) 


n; n2 


The test statistic for testing the null hypothesis Ho: p; — p2 = 0 is obtained by using formula 
(10.23) with p; — pz replaced by 0 andg,,-;, estimated by S,,5, as given in formula (/0.27). The 


resulting test statistic is given in formula (/0.28): 
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z= PoP (10.28) 


Siy-7, 


The steps for testing a hypothesis conceming p, — p2 are given in Table 10.16. 


Table 10.16 


Steps for Testing a Hypothesis Concerning p, — pz: Large Independent Samples 
Nip: > 5, N2p2 > 5, nig, > 5 and noqo > 5 


Step |: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p; — 
P2 = 0 and the research hypothesis is of the form H,: pi - p2 < 0 or H,: p; — p2 > 0 or Hy: pr — p2 # 0. 


Step 2: Use the standard normal distribution table and the level of significance, a, to determine the rejection 
region. 


ssa Pi- 
Step 3: Compute the value of the test statistic as follows: z* = 


. : ; ae 1 1) ~~ x1+x2 = = 
difference in sample proportions, S5,-5) = 4{PXq| —+— |. P = ,and g =1-p. 
nm m n+ n2 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls 
in the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 10.20 A study was conducted to compare teen cigarette use for whites and Hispanics. Suppose p; 
represents the proportion of teenage whites who use cigarettes and p; represents the proportion of teenage 
Hispanics who use cigarettes. The null hypothesis is Ho: p; — p2 = 0 and the research hypothesis is H,: p; — p2 ¥ 
0. The critical values for a level of significance equal to .05 are + 1.96. The sample results are given in Table 
10.17. 


Table 10.17 


Sample Sample size Number of smokers Sample proportion 


1. Whites n, = 1500 x, = 555 Pp, = 0.37 
2. Hispanics nz = 500 X2 = 175 P, = 0.35 


_f 1 1 l 1 
Soa IB Sree [365 x 635 | 4+ | ~ 0.02486 
tae {2 +) &. a) 


and the computed test statistic is: 


Based on this study, we would not be able to conclude that a difference exists between the proportion of 
smokers within the two groups of teenagers. 
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Solved Problems 


SAMPLING DISTRIBUTION OF X, — X, FOR LARGE INDEPENDENT SAMPLES 


10.1 A sample of size 50 is taken from a population having a mean equal to 90 and a standard 
deviation equal to 15. A second sample of size 70 and independent of the first sample is 
selected from another population having mean equal to 75 and standard deviation equal to 10. 
The mean of the sample of size 50 is represented by x, and the mean of the sample of size 70 is 
represented as X2 . 

(a) What type distribution does x, — x2 have? 
(b) What is the expected value of x, — x2? 
(c) What is the standard error of x; -— x2? 

(d) Transform x, — x2 to a standard normal. 


Ans. (a) Because both samples are 30 or more, Xj — X2 will have a normal distribution. 
(b) The expected value of XK, — X2 is E(X, — X,) = pl. =p)-p. =90-75= 15. 


Kp~X2 


| of of _ [225 100 
(c) The standard error of x) — X2 iS o5,-3, = ,J— +— = y+ = 2.43 
° m om 50 = 70 
Xi- k2-15. : 
(d)z= aa is a Standard normal variable. 


ESTIMATION OF py, - p22 USING LARGE INDEPENDENT SAMPLES 


10.2 Table 10.18 gives the summary statistics for the number of years that 250 men and women 
have spent with their current employers. Use these results to find a 90% confidence interval for 
Ji — 2, the mean difference in years spent with their current employer for men and women. 


Table 10.18 


Standard deviation 
1. Men 250 5.5 years 2.1 years 

2. Women 250 3.3 years 1.8 years 

. The difference 


in means is X, — X, = 5.5 — 3.3 = 2.2. The z value for a 90% confidence interval is 1.65. The 


; ssh 4.41 3.24 
estimated standard error of the difference in means is S;,-;, = 4J—7+ — = 4/77 +—— =.1749. 
ny 2 250-250 


The 90% margin of error when using X, — X, as an estimate of My — Hy is 1.65 x .1749 or .2886. 


Ans. The general form of the confidence interval for 1) - py is (X, -— X,) +z Koy 


R{-X2 


The confidence interval is 2.2 + .3. The confidence interval extends from 1.9 to 2.5 years. 


TESTING HYPOTHESIS ABOUT | - uz USING LARGE INDEPENDENT 
SAMPLES 


10.3 Use the data in Table 10.18 to test the null hypothesis that 1, — Md, = 1.5 years vs. the research 
hypothesis that [t; ~ 2 > 1.5 years at level of significance @ = 01. 
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Ans. Since the samples are large, the standard normal distribution table is used to ftnd the critical value. 
From the standard normal table, we find that P(z > 2.33) = .O1, and therefore the critical value is 
2.33, The computed value of the test statistic is given as follows: 


x oat —-Do _ §5-3.3-15 


= = 4.00 
Ox, -3, 1749 


z*= 


The standard error of the difference in means is replaced by the estimated standard error of the 
difference in means. It is concluded that the mean exceeds 1.5 years. 


SAMPLING DISTRIBUTION OF X, — X, FOR SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) 
STANDARD DEVIATIONS 


10.4 A sample of size 10 is taken from a normal population having a mean equal to 35 and a sample 
of size 15 is taken from another normal population having mean equal to 40. The two normal 
populations have equal variances. The mean and variance of the sample of size 10 are 
represented by x, and sj and the mean and variance of the sample of size 15 are represented 
by x2 and s. 

(a) Give the expression for the pooled estimate of the common population variance. 

(b) Give the expression for the estimated standard error of the difference in the sample means. 

(c) Use the results in parts (a) and (b) to form a statistic that has a t distribution with 23 
degrees of freedom. 


Ans. (a) g? = us Dst + (nz = Ist = Ox st +14X 53 


1 | 1 l 
(b) Si,-x, = S?| —+— |= Is? x [—s x , where S’ is given in part (a). 
ny nm 10 15 


x — X2—- (35-40 —x2+5 
(e). tS East Us lr 2 peg tea ioe rie lO) = Ae where S;,-;, is given in part (0). 
Sx,- x2 Si}-% Si)-%) 


ESTIMATION OF py; — pb: USING SMALL INDEPENDENT SAMPLES 
FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) 
STANDARD DEVIATIONS 


10.5 A comparison of mote! room rates for single occupancy was made for the cities of Omaha, 
Nebraska and Kansas City, Missouri. The rates for the two cities are shown in Table 10.19. 
Using the command % normplot of Minitab for both samples, it is found that it is reasonable 
to assume that both populations are normally distributed. Using the command %vartest of 
Minitab, it is also found that it is reasonable to assume that the populations have equal 
variability. The Minitab output for setting a 99% confidence interval on p, — Hy is given below. 
Verify the Pooled standard deviation, the degrees of freedom, and the 99% confidence interval 
given in the output. 
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Table 10.19 


MTB > twot 99% confidence data in c2, groups in cl; 
SUBC > pooled. 


Two Sample T-Test and Confidence Interval 
Two sample T for rate 


city 


| 
2 


N Mean St Dev SE Mean 
11 75.5 11.3 3.4 
11 81.4 14.3 4,3 


99% Cl for mu (1) — mu (2): (-21.6, 9.7) 
T-Test mu (1) = mu (2) (vs not =): T=-1.07 P=0.30 DF= 20 
Both use Pooled St Dev = 12.9 


Ans. 


The difference in sample mean is X, — X, = 75.5 - 81.4 =-5.9. 
(ny— Ist +(n2— 1s} 10 127.69 + 10x 204.49 


The pooled variance is S?> = ———————————_ = —---——__—_—— = 166.09, and S = 
nit+n2-2 1l+11-2 
V 166.09 = 12.9, 
| I 
The standard error of the difference in the sample means is S,,-;, = fist) = 
Z ny nh2 


16609x{ t+) = 5.50. 
tt ot 


Using the t distribution table with df = 20 and right-hand tail area equal to .00S, we find the t value 
is 2.845. The 99% margin of error is 2.845 x 5.50 = 15.6. The 99% confidence interval extends 
from —5.9 ~ 15.6 = -21.5 to-5.9 + 15.6 =9.7. 


TESTING HYPOTHESIS ABOUT jy, - pb: USING SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) 
STANDARD DEVIATIONS 


10.6 Use the motel rate data in Table 10.19 and the Minitab output given in problem 10.5 to test the 
null hypothesis Ho: [ly — He = O vs. H: Hy — He # 0 at significance level @ = .O1. 


(a) 
(b) 


(c) 


Give the critical values for performing the test. 

Give the computed value of the test statistic, and your conclusion based upon this value 
and the critical value in part (a). 

Give the p value and your conclusion based on this value. 
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Ans. 


(a) The degrees of freedom for the t distribution is 20. Since the research hypothesis is two-tailed, 
the significance Jevel is divided by 2 and .005 is put into each tail of the distribution. By 
consulting the t distribution table, we find that for df = 20 and right-tail area = .005, the t 
value is 2.845. The critical values are £2.845. 

(b) The computed t value, from the Minitab output in problem 10.5, is t* = —1.07. Since this value 
does not fall in the rejection region, the null hypothesis is not rejected. It cannot be concluded 
that the mean motel rates differ for the two cities. 

(c) The p value, from the Minitab output in problem 10.5, is equal to 0.30. Since this exceeds the 
preset level of significance, the null hypothesis is not rejected. The same conclusion is reached 
as in part (8). 


SAMPLING DISTRIBUTION OF X, — X, FOR SMALL INDEPENDENT 


SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL 
(AND UNKNOWN) STANDARD DEVIATIONS 


10.7 Refer to problem 10.4. Suppose the two populations have unequal population variances. What 
changes are needed to the answers given in the problem? 


Ans. 


Since the population variances are unequal, the sample variances are not pooled together to 
estimate a common population variance. The standard error of the difference in the sample means 


pee s  § _ {st 83 ee: Ma RETS ee 
is given by S;,-3, =4{—+— = +—. The statistic t = has a t distribution and 
° nm) M2 10 15 Xk? 


the degrees of freedom is given by df = minimum of {(n, — 1), (n2 — 1)} = minimum of (9, 14} = 
9. Note that the degrees of freedom is reduced from 23 to 9. This problem illustrates the 
importance of checking the assumptions underlying the estimation and testing procedures. The 
computation of the test statistic as well as the degrees of freedom is determined by the assumption 
concerning the variances of the two populations. 


ESTIMATION OF 4, - 12 USING SMALL INDEPENDENT SAMPLES 
FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) 
STANDARD DEVIATIONS 


10.8 Refer to problem 10.5. Compute the standard error of the difference in means, the degrees of 
freedom, and the 99% confidence interval assuming unequal population variances. Compare 
the results with those in problem 10.5. 


Ans. 


ag 2 2 
. ‘ ‘ Sj $2 11.3 14.3 
The standard error of the difference in sample means isS;,-;, = .J—+— = + 
ne Ne 1] 11 
which equals 5.50. This is the same answer obtained in the equal variances case, This will always 
occur if the sample sizes are equal. 


The degrees of freedom is df = minimum of {(n,; — 1), (nz — 1)} = minimum of {10, 10} = 10. 
Using the t distribution table with df = 10 and right-hand tail area equal to .005, we find the t value 
is 3.169. The 99% margin of error is 3.169 x 5.50 = 17.4. The 99% confidence interval extends 
from -5.9 — 17.4 = -23.3to-5.9417.4 =11.5. 


The 99% confidence interval in the equal variances case is (-21.5, 9.7). The 99% confidence 
interval for the unequal variances case is (—23.3, 11.5). Note that the interval in the unequal 
variances case is wider. 
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TESTING HYPOTHESIS ABOUT [1 - 2 USING SMALL INDEPENDENT 
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL 
(AND UNKNOWN) STANDARD DEVIATIONS 


10.9 Ina study of internet users, the average time spent online per week was determined for a group 
of college graduates as well as a group of non-college graduates. The results of the study are 
shown in Table 10.20. Test the research hypothesis that }1; > pl at level of significance a = .0S. 
Give the critical value, the computed test statistic, and your conclusion. Assume that the times 
are normally distributed for both populations. 


Table 10.20 


| Sample | Sample size | = Mean__{ Standard deviation 
1. College graduate 14 8.6 hours 1.1 hours 
2. Non-college graduate 12 6.3 hours 2.7 hours 


Ans. Some statisticians use the following rule to decide whether population variances are equal or not: 


Si : : 
If.5s S < 2, then assume 0, = 02. Otherwise, assume that 0, # 02. Since the ratio of the sample 
2 


is less than .5, we assume that the populations have unequal standard deviations. The degrees of 
freedom for this model ts df = minimum of {(n, — 1), (nz — 1)] = minimum of (13, 11}= 11. The 
critical value is determined by using the t distribution table with df = 11 and right-tail area equal to 
.0S. This value is found to equal 1.796. 


; ss 1.21 7.29 
The standard error of the difference in sample means is S;,-;, = 4{—-+— = 4J——+—— = 
ny Nz 14 12 


.833. The computed test statistic is t* = ———~_————*— = ——————_ = 2.76. The research 
x 33 


hypothesis is supported since t* = 2.76 exceeds the critical value, 1.796. 


SAMPLING DISTRIBUTION OF d FOR NORMALLY DISTRIBUTED 
DIFFERENCES COMPUTED FOR DEPENDENT SAMPLES 


10.10 Table 10.21 gives a set of paired data, along with the differences for the pairs. Answer the 
following questions concerning these paired data. 


_ -_. 2 
(a) Find the following: d = —, Sg = joe ee) in ,and Sj = Sd 
n n-| vn 


(b) What parameters do each of the statistics in part (a) estimate? 
d- pM, 


Si 


have a t distribution 


(c) What assumption is needed in order that the statistic t = 
with (n — 1) = (6 — 1) = 5 degrees of freedom? 


Table 10.21 


Sample | Sample 2 
15 


] 18 

2 22 
3 25 
4 22 
5 19 
6 22 
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= pe ee 
Ans. (a) Td=3414+2-24+0+2=6, Ed =94+14+44+44+0+4=22,d =£ =1,S4= ro 
1.789, and S722 = - 073 
= 1. , an Pe 
d a V6 


(b) The statistic, d, estimates [1g , the mean of the population of paired differences. The statistic, 
Sa, estimates Oy , the standard deviation of the population of paired differences. The statistic, 
S; » estimates o, , the standard error of the population of paired differences. 


(c) Itis assumed that the population of paired differences is normally distributed. 


ESTIMATION OF tg USING NORMALLY DISTRIBUTED DIFFERENCES 
COMPUTED FROM DEPENDENT SAMPLES 


10.11 A sociological study compared the salaries of 10 professional African-American women with 
the salaries of 10 corresponding professional White-American women. The women were 
paired according to certain salient characteristics and the 10 pairs were chosen from ten 
different professions. The salaries (in thousands) are shown in Table 10.22. 


Table 10.22 
I 65 60 5 


COO MmANA MN HR WN 


— 


The Minitab command tinterval 90 percent confidence data in c3 is used to produce the 
following output. The differences are computed and put into column c3. The command 
requests a 90% confidence interval on the mean difference. 


MTB > let c3 =cl —c2 
MTB > print cl —c3 


Data Display 
Row Afamer white diff 
1 65 60 5 


2 50 55 —5 
a 75 70 5 
4 80 75 5 
5 105 95 10 
6 90 100 -10 
7 65 70-5 
8 60 50 10 


9 115 105 10 
10 80 90 -10 


MTB > tinterval 90 percent confidence data in ¢3 
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Variable N Mean StDev SEMean 90.0%CI 
diff 10 150 8.18 2.59 (-3.24, 6.24) 


Verify that d = 1.50, Sy = 8.18, Sg = 2.59, and that the confidence interval is as given in the 
output. 


Ans. Using the differences given above, we find that Ed = 15, and d = 1.5. We also find that Xd? = 


[za? - zy [625-225 = 
625, and Sg= ee = oo ee 8.18. The estimated standard error of d is Sj = 
nh _ 


Sa _ 818196 
vn Vi0 


The confidence interval is given hy d +t x Sq. The t value for a 90% confidence interval is 
found as follows. The degrees of freedom is df = 10 ~ 1 = 9. For a 90% confidence interval 5% 
iS put into each tail of the t distribution. Using the t distribution table with 9 degrees of freedom 
and a right-hand tail area equal to .05, we find the t value to be 1.833. The 90% margin of error 
ist x Sq = 1.833 x 2.59 = 4.75. The 90% confidence interval goes from 1.50 - 4.75 = -3.25 to 
1.50 + 4.75 = 6.25. 


TESTING HYPOTHESIS ABOUT ty USING NORMALLY DISTRIBUTED 
DIFFERENCES COMPUTED FROM DEPENDENT SAMPLES 


10.12 Use the data in Table 10.22 to test the research hypothesis Hy # 0 at level of significance 
a= .01. 


Ans. The degrees of freedom is one less than the number of pairs, i.e., df = 9. Since the research 
hypothesis is two-tailed, tail areas equal to .005 are put into both tails of the t distribution. From 
the t distribution table the critical value is determined by using a right-tail area equal to .005. The 
critical value is equal to 3.250. 


From problem 10.11, the following are found: d = 1.50, Sq = 2.59, and Do = 0. The computed test 
d-D, 150-0 


Sti 
null hypothesis is not rejected. 


statistic is t® = = 0.58. Since this value falls between ~—3.250 and 3.250, the 


SAMPLING DISTRIBUTION OF P,-P, FOR LARGE INDEPENDENT SAMPLES 


10.13 Population | has p; = .010 and population 2 has p2 = .005. Independent samples of size 5000 
each are selected from both populations. Find the probability that p, — Pp, exceeds .015? 


Ans. The mean value of P, — P, is .010 — .005 = .005. The standard error of P, — P, ts as follows: 


x x 
epee Be eS aires: 
ni n2 5000 5000 


Since nip; > 5, np > 5, myqy > 5 and nyq2 > 5, P, — P» has a normal distribution. We are asked to 

P,~ P,-(P,—-P,) 2 P, — P,—.005 
OF,-B, 001725 

transform the statistic P, -— P, to a standard normal variable. The same transformation on .015 


find P(p, — P, > .015). The transformation z = is used to 
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sives the value Te = 5.80. We have the following: P(P, - P2 > .015) = P(z > 5.80), 


which is approximately 0. It is highly unlikely that p, — Pp, exceeds .015. 


ESTIMATION OF P; - P; USING LARGE SAMPLES 


10.14 Three thousand commuters in both New York and Chicago were surveyed and the percentage 
of commuters who took more than 60 minutes to get to work were determined for both 
groups. It was found that 16.5% in New York and 10.7% in Chicago required more than 60 
minutes. Set a 90% confidence interval on p; — p2, where p; corresponds to New York. 


-~ = PiXq, . PiX. 
Ans. The confidence interval is given by (P, - P2) + zx Sp,-,. where Sp,-p, = PiXSs , PaX Oa 
mi n2 


The z value for 90% confidence is 1.65. The difference in sample percentages is 16.5% — 10.7% = 
16.5 x 83.5 a 10.7 x 89.3 
3000 3000 


error is 1.65 x 0.88 = 1.5%. The 90% confidence interval extends from 5.8 - 1.5 = 4.3% to 5.8 + 
1.5=7.3%. 


5.8%. The standard error is §5,-5, = = 0.88%. The 90% margin of 


TESTING HYPOTHESIS ABOUT P, — P, USING LARGE INDEPENDENT 
SAMPLES 


10.15 A survey of 50 men and 50 women was conducted and it was found that 16 of the men and 10 
of the women used hotel room minibars. Test the hypothesis that p; = p2 vs. p1 # pat level of 
significance @ = .05. 

Xi + X2 


Ans. The critical values are + 1.96. The pooled estimate of the common proportion is Pp = Pah 
nhyrn2 


16+10 
50+50 


= .26. The standard error of the difference in proportions is S5,-5, = pxq t+) 
m n2 


1 1 ; : eo 
.26 X eed = .0877. The difference in sample proportions is Pp, - Pp, = .32 -.20=.16. 


not exceed the critical value, we cannot conclude that there is a difference between men and 
women users of hotel room minibars. 


Supplementary Problems 


SAMPLING DISTRIBUTION OF X, — X, FOR LARGE INDEPENDENT SAMPLES 


10.16 A sample of size 100 is selected from a population having mean 75 and standard deviation 3. Another 
independent sample of size 100 is selected from a population having mean 50 and standard deviation 4. 
Verify that x; — X2 has mean 25 and standard deviation 0.5. 
(a) What percent of the time will X, — X2 fall within 0.5 of 25? 
(b) What percent of the time will x, — x2 fall within 1.0 of 25? 
(c) What percent of the time will X,; — Xz fall within 1.5 of 25? 
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Ans. (a) 68% (b)95% (c) 99.7% 


ESTIMATION OF ,, - #2 USING LARGE INDEPENDENT SAMPLES 


10.17 The information shown in Table 10.23 was obtained from two independent samples selected from two 
populations. 
(a) Give a point estimate for py — py. 
(b) Find a 99% confidence interval for }) — po. 


Table 10.23 


| Sample __—|_— Sample size | = Mean_| Standard deviation 
| 50 $9,500 $1,250 
4 75 $9,125 $950 


Ans. (a) $375 (b) 375 2.58 x 208.05 or -$161.77 to $911.7 


TESTING HYPOTHESIS ABOUT 1, - p; USING LARGE INDEPENDENT SAMPLES 


10.18 Use the data given in problem 10.17 to test the hypothesis Ho: fy — 2 = 0 vs. Hy: Wy — 2 > O at level of 
significance «= OL. 
(a) Give the computed test statistic. 
(b) Give the p value. 
(c) Give your conclusion. 


-0 
= 1.80 (b) p value = .5 ~ .464] = .0359 
08.05 
(c) Do not reject Ho since p value > a. 


Ans. (a) z* = 


SAMPLING DISTRIBUTION OF X,~ X, FOR SMALL INDEPENDENT SAMPLES FROM 
NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 


10.19 A psychological study compared the language skills and mental development of two groups of two-year- 
olds. One group consisted of chatty toddlers and the other group consisted of quiet children. The scores 
on a test which measured language skills are shown for the two groups in Table 10.24. Use Minitab to 
determine whether it is reasonable to assume the populations of test scores are normally distributed and 
also determine if it is reasonable to assume that two populations have equal standard deviations. 


Table 10.24 


Chatty toddlers Quiet toddlers 


INFERENCES FOR TWO POPULATIONS 243 


Ans. The following normal probability plots were produced by Minitab. Beneath the graph, the mean, 


s.andard deviation, and sample size are shown as well as a p value for the Anderson-Darlington 
normality test. The p value corresponds to a null hypothesis, which states that the sample data were 
selected from a normally distributed population. This hypothesis is rejected if the p value is less 
than @ = .05S. Otherwise, normality is usually assumed. In this case, it is safe to assume that both 
samples were obtained from normally distributed populations since the p value for the chatty 
sample is 0.433 and the p value for the quiet sample is 0.368. 


Normal Probability Plot 


.999 
.99 
95 
80 


.50 
Probability. 


05 
01 


.001 


60 70 80 90 
chatty 


Average: 75.4545 Anderson-Darling Normality Test 


StDev: 11.2815 A-Squared: 0.337 
N: 11 P Value: 0.433 


Normal Probability Plot 


.999 7 
.99 
95 


.B0 
.50 
Probability 20 


.05 
01 


.001 


70 80 90 


quiet 
Average: 79.5455 Anderson-Darling Normality Test 


StDev: 8.50134 A-Squared: 0.365 
N: 11 P Value: 0.368 


244 INFERENCES FOR TWO POPULATIONS [CHAP. 10 


The following Minitab output may be used to test for equal population standard deviations. The 
null hypothesis of equal population standard deviations is rejected if the p value corresponding to 
Bartlett's test is less than @ = .OS. In this case, the p value equals 0.386, and is reasonable to 
assuine that GO, = O>. 


MTB > %vartest c2 cl 
Homogeneity of Variance 


Response — score 
Factors sample 
ConfLv! 95.0000 


Bartlett's Test (normal distribution) 
Test Statistic: 0.752 
P value : 0.386 


ESTIMATION OF p, - p2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL 
POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 


10.20 Refer to the data in Table 10.24. Minitab was used to set a 90% confidence interval on [, - Ly The 
output is shown below. Using the output, give a 90% confidence interval for Hy - U2 


MTB > twot 90% confidence, data in c2, sample number in cf; 
SUBC > pooled. 


Two Sample T-Test and Confidence Interval 
Two sample T for score 
sample N Mean StDev SE Mean 

} It 75.50 11.30 3.4 

2 ft = 79.55 8.50 2.6 


90% CI for mu (1) — mu (2): (-11.4, 3.3) 
T-Test mu (1) = mu (2) (vs not =): T= -0.96 P=0.35 DF= 20 
Both use Pooled St Dev = 9.99 


Ans. The 90% interval extends from —1 1.4 to 3.3. 


TESTING HYPOTHESIS ABOUT ,- ub; USING SMALL INDEPENDENT SAMPLES FROM 
NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 


10.21 Refer to problems 10.19 and 10.20. Suppose the research hypothesis is that the language skills scores 
differ for the two groups. Give the computed test statistic and the corresponding p valuc. 


Ans. The computed test statistic ts t* = -0.96 and the p value is 0.35. 


SAMPLING DISTRIBUTION OF FOR SMALL INDEPENDENT SAMPLES FROM 
NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS 


10.22 Thirty individuals who suffered from insomnia were randomly divided into two groups of 15 cach. One 
group was pul on an exercise program of 40 minutes per day for four days a week. The other group was 
not put on the exercise program and served as the control group. After six weeks, the time taken to fall 
asleep was measured for each individual in the study. The results are given in Table 10.25. 
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Table 10.25 


Exercise group Control group 


The Minitab output for the test of equal standard deviations is shown below. Would you assume equal 
or unequal population standard deviations? 


MTB > %vartest c2 cl 
Homogeneity of Variance 


Response C2 
Factors Cl 
ConfLvl 95.0000 


Bartlett's Test (normal distribution) 


Test Statistic: 23.039 
P value : 0.000 


Ans. The Minitab procedure is used to test Ho: 6; = 62 vs. H,: 6; # G2 Since the p value = 0.000, the 


null hypothesis should be rejected. Assume that the standard deviations are not equal for the two 
groups. 


ESTIMATION OF i - p, USING SMALL INDEPENDENT SAMPLES FROM NORMAL 
POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS 


10.23 Refer to the data in Table 10.25. The Minitab analysis for setting a 99% confidence interval on }) — [2 
is shown below. This analysis assumes unequal population standard deviations. 


MTB > twot 99% data in c2 groups incl 


Two Sample T-Test and Confidence Interval 


Two sample T for C2 
Cl N Mean StDev SE Mean 
1 15 15.40 1.40 0.36 


2 15 23.20 6.27 1.60 


99% Cl for mu (1) — mu (2): (-—12.69, —2.9) 
T-Test mu (1) = mu (2) (vs not =): T= 4.70 P=0.0003 DF= 15 
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Rather than finding the degrees of freedom by using df = minimum of {n, ~ |, np — |}, Minitab uses a 
different formula. If the degrees of freedom are found using df = minimum of (14, 14} = 14, the t value 
is 2.977. The t value, using df = {5, is 2.947. The confidence interval will be approximately the same for 
either value. Using df = 14, find a 90% confidence interval for 1, ~ [2. 


1.407 6.277 
+ 


Ans. (15.40 - 23.20) t 1.761 x 


or -7.8 + 2.9 or (-10.7, 4.9) 


TESTING HYPOTHESIS ABOUT 1, - », USING SMALL INDEPENDENT SAMPLES FROM 
NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS 


10.24 Refer to problems 10.22 and 10.23. Is there a difference in the time to go to sleep for the two groups? 


Ans. Yes, the computed test statistic 1s t* = —4.70, and the p value is 0.0003. 


SAMPLING DISTRIBUTION OF d FOR NORMALLY DISTRIBUTED DIFFERENCES COMPUTED 
FOR DEPENDENT SAMPLES 


10.25 Table 10.26 gives the diastolic blood pressure before treatment and six weeks after treatment is started 


d-H, 

d 
differences have a normal distribution. The basic normality assumption needs to be verified before 
setting a confidence interval on [ig or testing a hypothesis concerning tg The normal probability plot in 
Minitab can be used to test the following hypothesis: Hp: The differences are normally distributed vs. 
H,: The differences are not normally distributed. If the level of significance is set at the conventional 
level of significance a = .0S, then the null hypothesis is rejected if the p value < a. Using the Minitab 
output shown below and on the next page, what decision do you reach concerning the assumption of 
normality for the differences? 


for 10 hypertensive patients. The statistic, t = , has at distribution with df =n—- 1, provided the 


Table 10.26 


| Patient | Before | After__| Difference _| 
90 


— 


Ans. The p value for the Anderson-Darling Normality test is 0.233. The null hypothesis is not rejected, 
and it is assumed that the differences are normally distributed. 
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Normal Probability Plot 


.50 
.20 


05 
.O1 


.001 


Probability 


10 15 20 
difference 


Average: 14.3 Anderson-Darling Normality Test 


StDev: 3.94546 A-Squared: 0.438 
N: 10 P Value: 0.233 


ESTIMATION OF pg USING NORMALLY DISTRIBUTED DIFFERENCES COMPUTED 
FROM DEPENDENT SAMPLES 


10.26 Verify the output shown in the following Minitab output for a 90% confidence interval for the mean 
difference in the blood pressures given in problem 10.25. 


MTB > tinterval 90 percent confidence, data in cl 

Confidence Intervals 

Variable N Mean StDev SEMean 90.0% CI 
diff 10 = 14.30 3.95 1.25 (12.01, 16.59) 


Ans, d= 143 Yd? =2185 d =14.3 Sy =3.9455 Sg = 1.2477 t= 1.833 
d+tx Sq is found to be (12.01, 16.59). 
TESTING HYPOTHESIS ABOUT pg USING NORMALLY DISTRIBUTED DIFFERENCES 
COMPUTED FROM DEPENDENT SAMPLES 
10.27 Verify the computed test statistic in the below Minitab output for the differences in problem 10.25. 
MTB > ttest mu = 0 data incl 


T-Test of the Mean 
Test of mu = 0.00 vs mu not = 0.00 


Variable N Mean StDev SE Mean T P 

diff 10 14.30 3.95 1.25 11.46 0.0000 
d-D. 1430-0 

Ans. t* = LS = 11.44 
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SAMPLING DISTRIBUTION OF P, — P2 FOR LARGE INDEPENDENT SAMPLES 


10.28 Inastudy of 100 surgery patients, 50 were kept warm with blankets after surgery and 50 were kept cool. 


Eight of the warm group developed wound infections and 14 of the cool group developed wound 
infections. Let p, represent the proportion of all surgery patients kept warm after surgery who develop 
wound infections and let p2 represent the proportion for the cool group. The sample proportions for the 
two groups are Pp, = .16 and Pp, = .28. The difference in sample proportions, P, — P>, will have a normal 
distribution provided that njp; > 5, n2p2 > 5, nyqy > 5, and n2q2 > 5. Since the population proportions are 
unknown, these conditions cannot be checked directly. In practice, the conditions are checked out by 
substituting the sample proportions for the population proportions. Use the sample proportions to check 
out the requirements for assuming normality. 


Ans. np, = 8, mP, = 16, 14, = 42, and nq, = 34 py, - P, has a normal distribution. 


ESTIMATION OF P, - P, USING LARGE SAMPLES 
10.29 Refer to problem 10.28. Find a 90% confidence interval for p, — p2 


BiXG, | XG, 
nt n2 


Ans. The 90% confidence interval is (P,; ~ P2) + zx Ss,-—,, Where S5,-7,= 


(.16 — .28) + 1.65 x .0820 or —.12 + .14 or (-.26 ,, .02) 


TESTING HYPOTHESIS ABOUT P, - P; USING LARGE INDEPENDENT SAMPLES 
10.30 Refer to problem 10.28. Test Ho: pi; — p2= 0 vs. H,: p — pr< 0 at & = .05. 


X, + X2 


Ans. The pooled estimate of the common proportion is p = = 0.22. The standard error is 


nit n2 


B-B 
Si,-5, = pxq +++) = .0828. The computed test statistic is z* = . = = ~1.45. The 
ni nz Pi-Po 


critical value is -1.65. Do not reject the null hypothesis. 


Chapter 11 


Chi-square Procedures 


CHI-SQUARE DISTRIBUTION 


The Chi-square procedures discussed in this chapter utilize a distribution called the Chi-square 
distribution. The symbol x” is often used rather than the term Chi-square. The Greek letter x is 
pronounced Chi. Like the t distribution, the shape of the x’ distribution curve is determined by the 
degrees of freedom (df) associated with the distribution. Figure 11-1 shows x’ distributions for 5, 10, 
and 15 degrees of freedom. 


0.10 


0.00 


Fig. 11-1 
Table 11.1 gives some of the basic properties of x’ distribution curves. 


Table 11.1 


. The total area under a x’ curve is equal to one. 

. AX? curve starts at 0 on the horizontal axis and extends indefinitely to the right, 
approaching, but never touching the horizontal axis. 

. Ax’ curve is always skewed to the right. 

. As the number of degrees of freedom becomes larger, the y* curves look more and 
more like normal curves. 

. The mean of a x’ distribution is df and the variance is 2df. 

. When the degrees of freedom is 3 or more, the peak of the x” curve occurs at df -2. 
This value is the mode of the distribution. 


EXAMPLE 11.1 Table 11.2 gives the mean, mode, and standard deviation for each of the three x? curves 
shown in Fig. 11-1. The mean, mode, and standard deviation are determined by using properties 5 and 6 from 
Table 11.1. 


Table 11.2 


| df_ {| ~~ Mean___ | _Mode__| Standard deviation 


5 3.16 


10 4.47 
15 5.48 


249 


250 CHI-SQUARE PROCEDURES [CHAP. 11 


CHI-SQUARE TABLES 


The area in the right tail under the x’ distribution curve for various degrees of freedom is given 
in the Chi-square tables found in Appendix 4. Example {1.2 illustrates how to read this table. 


EXAMPLE 11.2 Table 11.3 contains the row corresponding to df = 5 from the Chi-square distribution table. 
Figures 11-2 and 11-3 give Chi-square curves having df = 5. The shaded area to the right of 11.070 in Fig 11-2 
is .050. The shaded area to the right of 1.610 in Fig 11-3 is equal to .900. These areas and Chi-square values are 
shown in bold print in Table 11.3. 


Table 11.3 
Hes aa Area in the right tail under the Chi-square distribution curve 


| af | 995 | 990 | 975 | 950 | 900 | 100 | 050 | 025 | 010 | 005 _| 
| sf 0.412 | 0.554 | 0.831 | 1.145 | 1.610 | 9.236 | 11.070 | 12.833 | 15.086 | 16.750 | 


0.1 
0.1 
01 
0.1 
0.05 0.05 
0.00 . 0.00 
11.070 x 
Fig. 11-2 
GOODNESS-OF-FIT TEST 


In many situations, each element of a population is assigned to one and only one of k categories 
or classes. Such a population is described by a multinomial probability distribution. Example 11.3 
describes such a population and illustrates the structure of the null and alternative hypotheses for a 
goodness-of-fit test. 


EXAMPLE 11.3 Consider the population of Americans who’ ve dieted. A survey reported that 85% were most 
likely to go off their diet on the weekend, 10% were most likely to go off on a weekday, and 5% didn’t know. 
This population is divided into three categories: category 1: most likely to go off their diet on the weekend, 
category 2: most likely to go off their diet on a weekday, and category 3: did not know when they were most 
likely to go off their diet. The multinomial probability distribution is: p; = .85, pz = .10, and p3; = .05. The Delta 
Health fitness club is interested in whether their members follow this same multinomial probability distribution. 
A goodness-of-fit test is used to test the null hypothesis: Ho: p, = .85, p2 = .10, and p; = .05 vs. the following 
alternative hypothesis: H,: The population proportions are not p, = .85, pz = .10, and p; = .05. The probabilities 
P}, P2, and p; in the hypotheses statements represent the proportions for the categories as applied to the health 
fitness club members. The next two sections will describe the steps for performing a goodness-of-fit test. 


EXAMPLE 11.4 Table 11.4 gives the age distribution of part-time college students as determined five years 
ago. If pi, p2, pa, pa, and ps represent the current percentages for the five groups, then the null hypothesis that the 
current distribution is the same as five years ago is stated as follows: Ho: p; = .25, p2 = .35, p3 = .25, pa = .10, 
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and ps = .05. The research hypothesis is stated as: H,: The current proportions are not as stated in the null 
hypothesis. A goodness-of-fit test is used to test this hypothesis system. 


Table 11.4 
OBSERVED AND EXPECTED FREQUENCIES 


The first step in performing a goodness-of-fit test is the selection of a sample of size n from the 
population and the determination of the observed frequencies for k classes. Recall that in testing 
hypothesis, the null hypothesis is assumed to be true, and is rejected if a highly unlikely value is 
obtained for the test statistic. The expected frequencies are computed assuming the null hypothesis to 
be true. 


EXAMPLE 11.5 In Example 11.3, 200 members of Delta Health fitness club were surveyed and it was found 
that 160 were most likely to go off the diet on the weekend, 22 were most likely to go off the diet on a weekday, 
and 18 did not know. If the null hypothesis is true and the club members follow the nationwide distribution, then 
the expected numbers in the three categories are: np, = 200 x .85 = 170, np2 = 200 x .10 = 20, and np; = 
200 x .0S = 10. The observed frequencies are 160, 22, and 18 and the expected frequencies are 170, 20, and 10. 


EXAMPLE 11.6 In Example 11.4, 1500 part-time students were surveyed across the country. It was observed 
that 352 were in the age group 18-24, 501 were in the age group 25-34, 371 were in the age group 35-44, 126 
were in the age group 45-54, and the remainder were in the age group 55 or over. If the null hypothesis is true 
and the distribution is the same as it was five years ago, the expected numbers in the five categories are as 
follows: np, = 1500 x .25 = 375, np, = 1500 x .35 = 525, np; = 1500 x .25 = 375, np, = 1500 x .10 = 150, and 
nps = 1500 x .05 = 75. The observed frequencies are 352, 501, 371, 126, and 150 and the expected frequencies 
are 375, 525, 375, 150, and 75. 


SAMPLING DISTRIBUTION OF THE GOODNESS-OF-FIT TEST STATISTIC 


The goodness-of-fit test statistic 1s given in formula (//./), where o represents an observed 
frequency and e represents an expected frequency, and the sum is over all k categories. 


2 
y= ee (11.1) 


The test statistic given in formula (//./) has a Chi-square distribution with df = k — 1, provided 
all the expected frequencies are 5 or more. Some statisticians use a less restrictive requirement, 
namely, that all expected frequencies are at least one and that at most 20% of the expected 
frequencies are less than 5. We shall use the requirernent that all expected frequencies are 5 or more. 
This requirement means that a minimum sample size is needed to use this procedure. If the observed 
and expected frequencies are close, then the computed value for x’ will be close to zero since the 
differences (o — e) will all be near zero. If the observed and expected values differ considerably, then 
the computed value of x’ will be large, supporting the research hypothesis. Since only large values of 
x indicate that the null hypothesis should be rejected and the research hypothesis supported, this is 
always a one-tailed test. That is, the null hypothesis is rejected only for large values of the computed 
test statistic. Table 11.5 summarizes the procedure for performing a goodness-of-fit test. 
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Table 11.5 


Steps for Performing a Goodness-of-Fit Test 


Step 1: State the null and research hypothesis concerning the hypothesized distribution for the k categories. 


Step 2: Use the x table and the level of significance, o, to determine the rejection region. 


(o-¢) 
€ 


2 


Step 3: Compute the value of the test statistic as follows: x? = > , where o represents the observed 


frequencies and e represents the expected frequencies. Check to make sure that all expected frequencies are 
5 or more. 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls 
in the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 11.7 Refer to Examples 11.3 and 11.5. The null hypothesis may be stated in either of two ways: 
Ho: The distribution of categories for Delta Health fitness club members is the same as the national distribution 
or Ho: p; = -85, p2 = .10, and p; = .05. The research hypothesis may also be stated in either of two ways: H,: The 
distribution of categories for Delta Health fitness club members is not the same as the national distribution or 
H,: The population proportions are not p; = .85, p2 = .10, and p; = .05. Table 11.6 illustrates the computation of 


the test statistic. The computed value of the test statistic is x * = 7.188. 


Table 11.6 


For level of significance a = .05, the critical value is 5.991. The row corresponding to df = 3 — 1 = 2 from 
the Chi-square table in Appendix 4 is shown in Table 11.7. 


Table 11.7 
Area in the right tail under the Chi-square distribution curve 


| df | 995 | 990 | 975 | 950 | 900 | 100 | ose | 025 | 010 | 005 _| 
| 2 | 0.010 | 0.020 | 0.051 | 0.103 | 0.211 | 4.605 | $.991_| 7.378 | 9.210 | 10.597 


Since the computed test statistic exceeds 5.991, the null hypothesis is rejected. There are almost twice as many 
in the Don’t Know category at Delta Health fitness club as would be expected if the distributions were the same. 


EXAMPLE 11.8 Refer to Examples [1.4 and 11.6. The null hypothesis is Ho: p; = .25, p2 = .35, p3 = .25, py = 
.10, and ps = .0S. The research hypothesis is stated as: H,: The current proportions are not as stated tn the null 
hypothesis. The observed and expected frequencies are given in Example 11.6. Table 11.8 illustrates the 


computation of the goodness-of-fit test statistic. The computed value of the test statistic is x *= 81.391. 


Table 11.8 


Age (o~e)' 
category 0 (o-ey e 


18-24 
25-34 


35-44 
45-54 
=e or over 
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The row corresponding to df = 5 — | = 4 from the Chi-square table in Appendix 4 is shown in Table 11.9. For 
level of significance a = .01, the critical value is 13.277 and is shown in bold type. 


Table 11.9 
995 | 990 | 975 | 950 | 900 | 100 | oso | 025 | 010 | 005 | 
0.207_| 0.297 | 0.484 | 0711 | 1.064 | 7.779 | 9.488 | 11.143 | 13.277 | 14.860 


Since the computed test statistic exceeds 13.277, the null hypothesis is rejected. There appears to have been an 
increase in the 55 or over group from 5 years ago. 


EXAMPLE 11.9 The Minitab solutions to Examples 11.7 and 11.8 are shown in Figures 11-4 and 11-5. 


MTB > set the observed values in cl 
DATA > 352 501 371 126 150 
DATA > end 

MTB > set the expected values in c2 
DATA > 375 525 375 150 75 
DATA > end 

MTB > let kl = sum((cl -— ¢2)**2/k2) 
MTB > print k1 


MTB > set the observed values inc] 
DATA > 160 22 18 

DATA > end 

MTB > set the expected values in c2 
DATA >170 20 10 

DATA > end 

MTB > let k! = sum((cl — c2)**2/c2) 
MTB > print k! 

kl 7.18824 k! 81.3905 

MTB > cdf k1 k2; 
SUBC > Chisquare 4. 
MTB > let k3 = 1 - k2 
MTB > print k3 


MTB > cdf k1 k2; 
SUBC > Chisquare 2. 
MTB > let k3 = 1 ~ k2 
MTB > print k3 


k3 0.0274849 k3 0 


Fig. 11-4 Fig. 11-5 


The upper part of both figures illustrates the computation of the test statistic for Examples 11.7 and 11.8. The 
computed values, 7.18824 and 81.3905, are the same as shown in Examples 11.7 and 11.8. The portion of the 
output shown in bold illustrates the computation of the p value for the two Examples. The p value is shown next 
to k3. The p value for Example 11.7 is 0.027 and the p value for Example 11.8 is 0. 


CHI-SQUARE INDEPENDENCE TEST 


Consider a survey of 100 males and 100 females concerning their opinion toward capital 
punishment. Tables 11.10 and 11.11 gives two different sets of results. In Table 11.10, the 
distribution of opinions is exactly the same for males and females. In this case, we say that the 
opinion concerning capital punishment is independent of the sex of the respondent. 


Table 11.10 


| | Supports | Opposes__ | Undecided | Rowtotal__| 
Miles = dite MOE 5 SE e202 oO 00 = 
|Female | 70 Tt 10 
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In Table 11.11, the distribution of opinions is clearly different for males and females. In this case, we 
say that the opinion concerning capital punishment is dependent on the sex of the respondent. 


Table 11.11 
| Supports [Opposes | Undecided | Rowtotal__| 
[Male | OCT 20s tC ett 
[Female | 40 | SC TO EO 
[Columntotal J 0 T7020 20 


Tables 11.10 and 11.11 are called contingency tables. In a Chi-square independence test, we are 
interested in using results such as those shown in these tables, to test for the independence of two 
characteristics on the elements of a population. In the above discussion, the two characteristics are 
sex of the individual and opinion of the individual concerning capital punishment. The conclusions 
regarding independence would be clear in the two tables given above. But suppose the results of the 
survey are not as clear cut as above. How do we decide from the results given in a contingency table 
if two characteristics are independent? Suppose the results of the survey were as shown in Table 
11.12. 


Table 11.12 

| Supports | Opposes__ | Undecided _| _ Row total__| 
[Male | 8 tS 0 
feral soot ii MO a ee as a OO! 
[Columntotal | 150] Ot P20 


In a Chi-square test of independence, the null hypothesis is that the two characteristics are 
independent. The observed frequencies are shown in Table 11.12. A table of expected frequencies is 
also determined assuming the null hypothesis to be true. In Table 11.12, note that 150 out of 200, or 
75% of those surveyed, supported capital punishment. If the two characteristics are independent, we 
would expect 75% of the 100 males and 75% of the 100 females to support capital punishment. From 
this discussion, note that if e,, represents the expected frequency in the first row, first column cell, 
then 


en oa a 75 et ae Oa Solum tora) 


200 sample size 
In general, the expected frequency in row i column j is given by formula (//.2): 


£ Ow Rites (eolumarh ola) 


ij = 


: (11.2) 
samplesize 


Table 11.13 shows the computation of the expected frequencies for Table 11.12. 


Table 11.13 


Supports [Opposes | Undecided [Row total] 

rae tp 
200. 200 200 

oneal ar sed Ee a oe ll 
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EXAMPLE 11.10 The table of expected frequencies for Table 11.10 is given in Table 11.14. Notice that when 
the contingency table indicates independence, the observed and expected frequencies are exactly the same. 


Table 11.14 


| | Supports | Opposes__| Undecided | Row total 
Male 100x140 _ 4, 10040 _ 4, 10020 _ |, se 
200 200 200 


Te eel ae 


EXAMPLE 11.11 The table of expected frequencies for Table 11.11 is given in Table 11.15. Notice that when 
the contingency table indicates strong dependence, that the observed and expected frequencies are very 
different. 


Table 11.15 


LL Supports | Opposes__| Undecided 


Male 100x110 5, | 100x70_,. 100 20 _ io 100 
200 200 200 

Female 100x110 _ . 10070 _ 45 2. 1G 100 
200 200 200 


How different do the observed frequencies and those you would expect when the characteristics are 
independent need to be before you would reject independence? The test statistic given in the next 
section will answer this question. 


SAMPLING DISTRIBUTION OF THE TEST STATISTIC FOR THE 
CHI-SQUARE INDEPENDENCE TEST 


Table 11.16 gives the observed frequencies and the expected frequencies in parenthesis for the 
data on Table 11.12. The null hypothesis is: Ho: Opinion concerning capital punishment is 
independent of the sex of the individual. The research hypothesis is: H,: Opinion concerning capital 
punishment differs for males and females. The test statistic for testing this hypothesis is given by 
formula (//.3), where the sum is over all 6 cells. If there are r rows and c columns, there will be r x c 
cells in the contingency table. The test statistic has a Chi-square distribution with (r — 1) x (c — 1) 
degrees of freedom. 


2 
ay OS (11.3) 


Table 11.16 


|| Supports | Opposes__ | Undecided_| Row total__ |] 
[Male | 8075) 1520) | 5S) 10 
[Female | 70(75)_ | 2520) | 5S) 10 


The computed value of the test statistic is found as follows: 


2 2 2 2 2 2 2 
2 Pore)? _ 80-75)? (5-20? 5=5) (70-75)? (25-20)? (= 57? _ 
t= Dd e ape Be age ge ee 
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The degrees of freedom is equal to df = (r — 1) x (c —- 1) = (2-1) x (3 - 1) = 2. The critical value 
from the Chi-square distribution table is 5.991 for a = .05. Since the computed test statistic does not 
exceed the critical value, the null hypothesis is not rejected. We cannot reject that the characteristics 
are independent. A Minitab analysis of the same data is shown below. The data are read first. The 
command Chisquare cl —- c3 produces the expected frequencies, the computed value of the test 
statistic, and the p value. The p value is equal to 0.205. 


MTB > read cl ~c3 
DATA > 80 15 5 
DATA > 70 25 5 
DATA > end 
2 rows read. 
MTB > Chisquare cl ~ c3 
Expected counts are printed below observed counts 


Cl C2 C3 Total 


I 80 15 5 100 
75.00 20.00 5.00 
2 70 2 5 100 


75.00 20.00 5.00 
Total 150 40 10 = 200 


Chi-Sq = 0.333 + 1.250 + 0.000 + 0.333 + 1.250 + 0.000 = 3.167 
DF = 2, P value = 0.205 


EXAMPLE 11.12 The Minitab output for the data in Tables 11.10 and 11.11 are shown in Figures 11-6 and 
11-7. 


MTB > read cl — c3 
DATA>70 20 10 
DATA>70 20 10 
DATA > end 

2 rows read. 
MTB > Chisquare cl — c3 


MTB > read cl — c3 
DATA>70 20 10 
DATA>40 50 10 
DATA > end 

2 rows read. 
MTB > Chisquare cl — c3 


Expected counts are printed below observed counts Expected counts are printed below observed counts 


C2 Total 
20 100 
35.00 


Total 


50 100 
35.00 
200 Total 110 70 200 
Chi-Sq = 0.000 + 0.000 + 0.000 + 0.000 + 0.000 + 


0.000 = 0.000 
DF = 2, P value = !.000 


Chi-Sq = 4.091 + 6.429 + 0.000 + 4.091 + 6.429 + 
0.000 = 21.039 
DF = 2, P value = 0.000 


Fig. 11-6 Fig. 11-7 
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SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE 


The population and sample variance was defined and illustrated in Chapter 3. The population 
variance is given by formula (/ 1.4): 


_ 2 
gins a H) (11.4) 
The sample variance is given by 
2 
3 = oie (11.5) 


The concept of the sampling distribution of the sample variance is established in Example 11.13. 


EXAMPLE 11.13 Consider the small finite population consisting of the times required for six individuals to 
open a “Child proof” aspirin bottle. The required times (in seconds) are shown in Table 11.17. The population 
mean is 1 = 30 seconds. The population variance is found as follows: 


Y(x—-p)? _ 400+100+0+0+100+ 400 


o = = = 166.67 
N 6 


The standard deviation is o= 4166.67 = 12.9. 


Table 11.17 


Individual 


There are 20 different samples of size 3 possible and they are listed along with the sample variance and 
sampling error for each sample in Table 11.18. 


Table 11.18 
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The distribution of the sample variance is obtained from Table 11.18 and is given in Table 11.19. 


Table 11.19 
33:29 100.0 133.40 233.48 ~—400.00_ 432.64 


3 | 3 l 


The above procedure may be used to find the sampling distribution of the sample variance when 
sampling from a finite population. However, it ts clear that it is a tedious procedure even when using 
a computer. When sampling from an infinite population, the result given in Table 11.20 is used to 


(n—1)S? 


determine the sampling distribution for a function of the sample variance, namely, The 


proof of this result is beyond the scope of this text. In the next section, this result is utilized to set a 
confidence interval on 0” as well as test hypotheses about 0”. 


Table 11.20 
(n~1)S? 
fog 


When a simple random sample of size n is selected from a normally distributed 


Sampling Distribution of 


» (n= 1S? 
population having population variance, O°,———_——_ has a Chi-square distribution 


om 


with (n ~ 1) degrees of freedom, where S? is the sample variance. 


INFERENCES CONCERNING THE POPULATION VARIANCE 


The result given in Table 11.20 will be used to find confidence intervals and test hypotheses 
about a population variance or standard deviation. 


EXAMPLE 11.14 It is important that drug manufacturers control the variation of the dosage in their products. 
A drug company produces 250 milligram tablets of the antibiotic amoxicillin. A sample of size 15 is selected 
from the production process and the level of amoxicillin is determined for each tablet. The results are shown in 
Table 11.21. 


Table 11.21 


The mean of the sample is x = 250.000 mg and the sample variance is s’ = 00008538 mg’. According to Table 


(n~1)S? 148? , Eee 
11.20, he © = ar has a Chi-square distribution with df = 14. Table 11.22 gives the row corresponding to 


df = {4 from the Chi-square table in Appendix 4. 


Table 11,22 


Area in the right tail under the Chi-square distribution curve 


Pe eeaatl 
| df | 995 | 990 | 975 | 950 | 900 | 100 | .050 | 
4.660 
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2 2 
has a Chi-square distribution with df = 14, we know that 


is between 5.629 and 26.119 with 


Since 


probability .95. This is true because according to Table 11.22, there is 97.5% of the area under the curve to the 
right of 5.629 and 2.5% of the area under the curve to the right of 26.119, and therefore 95% of the area under 
the curve is between 5.629 and 26.119. This may be expressed as follows: 


148 
o 


2 
P(5.629 < —— < 26.119) = .95 


The following notation is often used for the tabled Chi-square values: X975 = 5.629 and rae = 26.119. The 
probability statement may be expressed as follows: 


14S? 
P( x35 < ae < X05) = 95 


as nad. NES cae 148? 148? 
Now, the if inequality ¥‘75 < —— < X ‘os is solved for 0”, we obtain 5 MOK", 
om X.025 X975 
interval for 6° . The numerical confidence interval is obtained by replacing S* by .00008538 mg’ and using the 
values obtained from the Chi-square table. The lower confidence limit ts: 


as a 95% confidence 


14S? 14 x 0000 
anal ee OLUL ES = .00004576 
X05 26.119 
The upper confidence interval ts: 
148? 14x, 
al x 00008538 _— 4124 


Kose 5.629 


The 95% confidence interval for o” goes from .00004576 to .0002124. The 95% confidence interval for © is 
obtained by taking the square root of the limits for o”. The lower limit is ¥.00004576 = .006765 and the upper 


limit is ¥.0002124 = .014574. The 95% confidence interval for the standard deviation (to three decimal places) 
is (.007, .015). 


The general form for a (1 — &) x 100% confidence interval for the population variance is given 
by formula (//.6), where the values of x’ are based on the Chi-square distribution with df =n — 1. 


(n—1)S? ae (n—1)8? 


(11.6) 
Xa Kier? 


The (1 — @) x 100% confidence interval for the population standard deviation is given by formula 


(11.7). 
| -1S8? | -1)S? 
Xal2 X1-0/2 


Both of the confidence intervals assume that the sample was obtained from a normally distributed 
population. 
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EXAMPLE 11.15 A random sample of the lengths of bolts produced by Fastners, Inc. was taken. The sample 
results were as follows: n = 30, K = 5.000 cm, and s = 0.055 cm. A 99% confidence interval for 6 is delermined 


Qa 104 
as follows: For a 99% confidence interval, | — a = .99, and therefore @ = .O1, or ~ = .0OS and | —- — = .995. 
2 2 


The degrees of freedom is df = n - 1 = 30 - | = 29. Table 11.23 gives that portion of the Chi-square table 
needed to determine the values for use in formula (//.7). The following values are shown in bold print in the 
table:X¥ 395 = 13.121 and X% 5 = 52.336. The sample variance is s” = (0.055)’ = .003025. Substituting into 
formula (//.7) we obtain the following 99% confidence interval for the population standard deviation. 


jin=s? | n—1)S? 9x. 
pa eS — or [ee ge jo? x 105025 or 041 < 0 < 082 
Xor2 X1-a/2 52.336 13.121 


The population standard deviation of bolt lengths is between .041 cm and .082 cm with 99% confidence. 


Table 11.23 
Area in the right tail under the Chi-square distribution curve 


n-1)S? 
o 
concerning the population variance or standard deviation. The steps for testing a population variance 

are given in Table 11.24. 


The sampling distribution of , given in Table 11.20, is also utilized to test hypotheses 


Table 11.24 


Steps for Testing a Hypothesis Concerning o: Sampling from a Normal Population 


Step |: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: C=c2 
and the research hypothesis is of the form H,: o # of orH,: o < of or H,: o>oi. 


Step 2: For level of significance @, the critical values for H,: o # of are tia and (oare the critical value 


for Hy: 0? < of is X¥2.q. and the critical value for H,: 6” > of is XQ. 


(n~1)S? 
ro 


Step 3: Compute the value of the test statistic as follows: y* = , where oa is given in the null 


hypothesis and S’ is computed from your sample. 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in 
the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 11.16 The ratio of potassium to sodium in an individual's diet is sometimes referred to as the K 
factor. A sample of [5 Yanomamo Indians from Brazil was obtained and their K factors were determined. The 
standard deviation of the sample of K factor values was found to equal 0.15. These sample results were used to 
test the hypothesis Ho: 6 = 0.25 vs. H,: 6 # 0.25 at level of significance @ = .0S. The critical values are shown in 
bold in Table 11.25. From this table, we have ¥475 = 5.629 and X25 = 26.119. The computed value of the test 


Statistic 1s: 
~l)s’ 14 > 14 x 02 
fas) S! 2 188 AS MOP) ein 


ror 25° 0625 


ze 


Since the computed value of the test statistic is less than 5.629, the null hypothesis is rejected and it is concluded 
that the standard deviation for Yanomamo Indians is less than 0.25. 
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Table 11.25 


Solved Problems 


CHI-SQUARE DISTRIBUTION 


11.1 What happens to the peak of the Chi-square curve as the degrees of freedom is increased? 
Ans. The peak of the Chi-square curve shifts to the right as the degrees of freedom is increased. 


11.2 The area under the Chi-square curve corresponding to x’ values greater than 118.498 is .10. 
Find P(O< x’ < 118.498). 


Ans. Since the total area under the Chi-square curve is equal to 1, P(O < x’ < 118.498) is equal to 
1 — P(x? > 118.498) = 1 - .10 = .90. 


CHI-SQUARE TABLES 


11.3 Find the value of x’ for 5 degrees of freedom and 
(a) .OOS area in the right tail of the Chi-square distribution curve. 
(b) .005 area in the left tail of the Chi-square distribution curve. 


Ans. Table 11.26 gives the row corresponding to df = 5 from the Chi-square table. (a) The table 
indicates thatthe area to the right of 16.750 is .005. (b) The table indicates that the area to the 
right of 0.412 is .995. Therefore, the area to the left of 0.412 is | — .995 = .005. 


Table 11.26 


Area in the right tail under the Chi-square distribution curve 


11.4 Table 11.27 gives the row corresponding to df = 10 from the Chi-square table. Find the area 
under the Chi-square distribution curve having df = 10 corresponding to x’ between 4.865 and 
23.209. 

Table 11.27 


Ans. According to Table 11.27, the area to the right of 23.209 is .010 and the area to the right of 4.865 
is .900. In Fig. 11-8, the shaded area equals .010 and corresponds to x’ values exceeding 23.209. 
The shaded area in Fig. 11-9 equals .900 and corresponds to x’ exceeding 4.865. The area 
corresponding to x” between 4.865 and 23.209 is the difference in the two shaded areas, or .900 - 
.010 = .890. 
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0.1 0.1 


0.1 


Area = .010 


0.05 0.05 


0.00 


2 0.00 
x 


Fig. 11-8 Fig. 11-9 


GOODNESS-OF-FIT TEST 


11.5 


11.6 


A magazine reported that 75% of the population oppose same sex marriages, 20% approve, and 
5% are undecided. A survey is conducted to test the research hypothesis that the distribution is 
different from that reported by the magazine. State the null and alternative hypotheses in terms 
of pi , p2 , and p3; where p; = the population proportion opposed to same sex marriage, p2 = the 
population proportion who approve of same sex marriage, and p; = the proportion who are 
undecided. 


Ans. The null hypothesis is Hp: The population proportions are p, = .75, pz = .20, and p; = .05. The 
research hypothesis is H,: The population proportions are not p; = .75, p2 = .20, and p3 = .05. 


The fairness of a die may be tested as a goodness-of-fit test. Let p; represent the proportion of 
the time that face 1 turns up when the die is tossed, where i is 1, 2, 3, 4, 5, or 6. If the null 
hypothesis is that the die is fair and the research hypothesis is that the die is unfair, state Ho 
and H, in terms of the pj . 


Ans. The null hypothesis is Ho: py = p2 = p3 = pa = Ps = Po = : . The research hypothesis is H,: Not all p; 


equal -. 


OBSERVED AND EXPECTED FREQUENCIES 


11.7 


11.8 


Refer to problem 11.5. A poll was conducted and 585 opposed same sex marriage, 195 
approved, 120 were undecided. What results would you expect if the null hypothesis were 
true? 


Ans. The sample size is n = 900. The expected frequencies are: e; = np; = 900 x .75 = 675, e, = np) = 
900 x .20 = 180, and e; = np; = 900 x .05 = 45. 


Refer to problem 11.6. The die in question was rolled and face 1 turned up 89 times, face 2 
turned up 93 times, face 3 turned up 103 times, face 4 turned up III times, face 5 turned up 
100 times, and face 6 turned up 104 times. What results would you expect if the null 
hypothesis is true and the die is fair? 


Ans. The sample size is n = 600. If the null hypothesis is true, then each face should occur 100 times in 
600 rolls. 
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SAMPLING DISTRIBUTION OF THE GOODNESS-OF-FIT TEST STATISTIC 


11.9 Refer to problems 11.5 and 11.7. Test the hypothesis at & = .01. 


Ans. The number of categories is k = 3, and the degrees of freedom is df = k — 1 = 2. By referring to the 
Chi-square distribution table in Appendix 4, we find that the critical value for a 1% level of 
significance is equal to 9.210. The null hypothesis will be rejected if the computed value of the test 
statistic exceeds this critical value. 


The computation of the test statistic is shown in Table 11.28. Since the computed test statistic is 
138.250, the null hypothesis is rejected. There is a much higher number in the undecided group 
than would be expected if the null hypothesis were true. 


Table 11.28 


Oppose 


Support 
Undecided 


11.10 Refer to problems 11.6 and 11.8. Test the hypothesis at & = .05. 


Ans. The number of categories is k = 6, and the degrees of freedom is df = k — 1 = 5. Table 11.29 gives 
the row corresponding to df = 5 from the Chi-square distribution table in Appendix 4. The critical 
value is 11.070, as shown in bold print. 


Table 11.29 
Area in the right tail under the Chi-square distribution curve 


et 
| bf_| 995 | 990 | 975 | 950 | 900 | 100 | .oso_| 025 | 010 | .005_| 
| Sf 0.412 {0.554 | 0.831 | 1.145 | 1.610 | 9.236 | 11.070 | 12.833 | 15.086 | 16.750 | 


The computation of the test statistic is shown in Table 11.30. The computed value of the test 
statistic is 3.16. Based on this experiment, there is no reason to doubt the fairness of the die. 


Table 11.30 


CHI-SQUARE INDEPENDENCE TEST 


11.11 A study involving several cities from across the country involving crime was conducted. The 
cities were divided into the categories South, Northeast, North Central, and West. Based on 
interviews, crime was classified as a major concern, a minor concern, or of no concern for 
each individual interviewed. The results are shown in Table 11.31. If the level of concern 
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regarding crime is independent of the section of the country, find the expected frequencies for 


the 12 cells. 
Table 11.31 
North Cental 


a YEN AE ET 
|_Noconcern | 1S S20 


Ans. The computations of the expected frequencies are shown in Table 11.32. 


Table 11.32 


[_ Noah Cental] 


0% 105 _ 3749 40% 95 - 3410 = MOX100 «35.90 Pe 
oor Ce oc Da a 


Cooma 1S td 


11.12 Table 11.33 gives the results of a study involving marital status and net worth in thousands. 
Find the expected frequencies assuming the two characteristics are independent. Comment on 
the expected frequencies. 


Table 11.33 


|__Net worth | Married __|__Single__|_Widowed_| 
100 to 249 


25010499 | 60 | 20 tS 
500 to 999 
| 1000ormore [10 T 


Ans. The computations of the expected frequencies are shown in Table 11.34. 


Table 11.34 


|_Net worth | Married | Single | Widowed | Row total 


100 — 249 MoxH0 caimes SH 6 Ox arn te 
250 — 499 i a AT ay a 


500 — 999 Ba xT 22x96 _ 438 22 
482 482 


1000 or more sai 15x76 15x96 15 


Note that 4 of the {5, or 27%, of the expected frequencies are less than 5. The Chi-square test of 
independence is not appropriate. 
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SAMPLING DISTRIBUTION OF THE TEST STATISTIC FOR THE 
CHI-SQUARE INDEPENDENCE TEST 


11.13 Use Minitab to test the hypothesis that opinions regarding crime is independent of the section 


11.14 


of the country in problem 11.11 at level of significance @ = .05. Compare the expected 
frequencies in the Minitab output with those computed in problem 11.11. 


Row Cl C2 C3 C4 
1 25 40 35 £40 
2 65 45 40 £440 
3 15 10 15 20 


MTB > chisquare cl—c4 
Expected counts are printed below observed counts. 


Cl C2 C3 C4 _~—s Total 
] 25 40 35 40 140 


2 65 45 40 40 190 
2 15 10 15 20 60 


Total 105 95 90 100 390 


Chi-Sq = 4.274 + 1.020 + 0.224 + 0.469 + 3.748 + 0.036 + 0.337 + 1.560 + 0.082 + 1.457 + 0.096 + 
1.385 = 14.688 
DF = 6, P value = 0.023 


Ans. Since the computed p value = 0.023 is less than a = .05, the null hypothesis is rejected. The 
expected frequencies are the same as computed in problem 11.11. 


Refer to problem 11.12. Combine the categories Single and Widowed into a new category 


called Nonmarried so that all cell expected frequencies exceed 5 and perform a test of 
independence. 


Ans. Table 11,35 gives the new results after combining categories. 


Table 11.35 
| 250to499 | 60 | 45 
| 1000 ormore {10S 


The Minitab analysis of the test of independence is as follows: 


Data Display 

Row Cl C2 
1 225 115 
2 60 45 
3 15 7 
4 10 5 
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MTB > Chisquare cl c2 


Chi-Square Test 
Expected counts are printed below observed counts 


Ci C2 ~— Total 

] 225 115 340 
218.67 121.33 

2 60 45 105 
67.53 37.47 

3 15 7 22 
14.15 7.85 

4 10 5 15 
9.65 5.35 


Total 310 172 482 


Chi-Sq = 0.183 + 0.330 + 0.840 + 1.514 + 0.051 + 0.092 + 0.013 + 0.023 = 3.046 
DF = 3, P value = 0.385 


SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE 


11.15 A small population consists of the three values 10, 20, and 30. Determine the population 
variance and the sampling distribution of s* for samples of size 2. 


Ans. The population mean is 20. The population variance 1s: 


»_ E(x= Hy? _ (UO= 20)? + (20-20)? + (30-20)? 
ie} 


The three samples and their variances are: 10 and 20, 5 = 50; 10 and 30, s? = 200; 20 and 30, s? = 
50. The sampling distribution for s* is: P(s’ = 50) = a, P(s” = 200) = 4 


11.16 A sample of size 15 is taken from a normal population having 6” = 14. What is the probability 
that s* exceeds 21.064? 


(n-1)S?— 14s? 


Ans. The statistic 3 = re = s° has a Chi-square distribution with 14 degrees of freedom. 
From the Chi-square table, the area to the right of 21.064 is 0.10 when df = 14. Therefore P(s? > 
21.064) = 0.10. 


INFERENCES CONCERNING THE POPULATION VARIANCE 


11.17 The standard deviation for the annual medical costs of 30 heart disease patients was found to 
~ equal $1005. Find a 95% confidence interval for the standard deviation of the annual medical 
costs of all heart disease patients. 


Ans. The general form of the confidence interval for the population variance is 
(a-DS* (n= DS? 
——— L0O° < ———_ 


5 : 
a2 X1-a/2 
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= 975. 


nN | R 


a 
The level of confidence is 1 ~ @ = .95. This implies that a = .0S, 3 = .025, and 1 - 


The Chi-square table values are X25 = 45.722 and X75 = 16.047, for df = 29. 
The lower limit for the 95% confidence interval for o” is 
-)s? 

(n . ) 7 29 x 1,010,025 ~ 640.626.5037, 

Xe 45.722 
The upper limit for the 95% confidence interval for 6” is 
(n—N)S? _ 29x1,010,025 

Xie 16.047 
The lower limit for the 95% confidence interval foro is 640,6265037 = $800.39. 


The upper limit for the 95% confidence interval foro is ¥1,825,308.469 = $1,351.04. 


= 1,825,308.469. 


11.18 In manufacturing processes, it is desirable to control the variability of mass produced items. A 


machine fills containers with 5.68 liters of bleach. The variability in fills 1s acceptable 
provided the standard deviation is 0.15 liters or less. A sample of 10 containers is chosen to 
test the research hypothesis that the standard deviation exceeds 0.15 liters at level of 
significance & = .05. What conclusion is made if the variance of the sample is found to equal 
0.095 liters? 


Ans. The null hypothesis ts Ho: 6 = 0.15 and the research hypothesis is H,: 6 > 0.15. 
The degrees of freedom is df = 10 - | = 9, and the critical value is rar = 16.919. 
oe -1)s? 9 x 095 
The computed value of the test statistic is y?* = baie =——— = 
rer 0225 


Based on this sample, it is concluded that the variability is unacceptable and the filling machine 
needs to be checked. 


Supplementary Problems 


CHI-SQUARE DISTRIBUTION 


11.19 


11.20 


Which one of the following terms best describes the Chi-square distribution curve: right-skewed, left 
skewed, symmetric, or uniform? 


Ans. right-skewed 
According to Chebyshev’s theorem, at least 75% of any distribution will fall within 2 standard 
deviations of the mean. For a Chi-square distribution having 8 degrees of freedom, find two values a and 


b such that at least 75% of the area under the distribution curve will be between those values. 


Ans. a=8-2x4=0,b=8+2x4= 16 


CHI-SQUARE TABLES 


11.21 


A Chi-square distribution has I! degrees of freedom. Find the x’ values corresponding to the following 
right-hand tail areas. 
(a) .005 (b).975 (c).990 (d) .025 


Ans. (a) 26.757 (b) 3.816 (c) 3.053 (d) 21.920 
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11.22 
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A Chi-square distribution has 17 degrees of freedom. Find the left-hand tail area corresponding to the 
following values. 
(a) 30.191 (b) 6.408 (c) 33.409 (d) 24.769 


Ans. (a) .975 (6) .O10 (c).990 (d) .900 


GOODNESS-OF-FIT TEST 


11.23 


11.24 


A national survey found that among married Americans age 40 to 65 with household income above 
$50,000, 45% planned to work after retirement age, 45% planned not to work after retirement age, and 
10% were not sure. A similar survey was conducted in Nebraska. Let p; represent the percent in 
Nebraska who plan to work after retirement age, p2 represent the percent in Nebraska that plan not to 
work after retirement age, and p3 represent the percent not sure. What null and research hypotheses 
would be tested to compare the multinomial probability distribution in Nebraska with the national 
distribution? 


Ans. Ho: p, = .45, p2 = -45, and p; = .!0; H,: The proportions are not as stated in the null hypothesis. 


A sociological study concerning impaired coworkers was conducted. It was reported that 30% of 
American workers have worked with someone whose alcohol or drug use affected his/her job, 60% have 
not worked with such individuals, and 10% did not know. The Central Intelligence Agency (CIA) 
conducted a similar study of CIA employees. Let p, represent the proportion of CIA employees who 
have worked with someone whose alcohol or drug use affected his/her job, p2 represent the proportion 
of CIA employees who have not worked with such coworkers, and p; represent the proportion who do 
not know. What null and research hypothesis would you test to determine if the CIA multinomial 
distribution was the same as that reported in the study? 


Ans. Ho: p; = .30, p2 = .60, and p3 = .10; H,: The proportions are not as stated in the null hypothesis. 


OBSERVED AND EXPECTED FREQUENCIES 


11.25 


11.26 


Refer to problem 11.23. The Nebraska survey found that 120 planned to work after retirement age, 160 
did not plan to work after retirement age, and 35 were not sure. What are the expected frequencies? 


Ans. e; = 141.75, e2 = 141.75, and e3 = 31.5 


Refer to problem 11.24. The CIA study found that 37 had worked with someone whose alcohol or drug 
use had affected their job, 58 had not, and 17 did not know. What are the expected frequencies? 


Ans. e; = 33.6, e2 = 67.2, and e; = 11.2 


SAMPLING DISTRIBUTION OF THE GOODNESS-OF-FIT TEST STATISTIC 


11,27 


11.28 


Refer to problems 11.23 and 11.25. Perform the test at a@ = .01. 


Ans. The critical value is 9.21. The computed value of the test statistic is ¥°* = 6.076. It cannot be 
concluded that the Nebraska distribution is different from the national distribution. 


Refer to problems 11.24 and 11.26. Perform the test at a = .05. 


Ans. The critical value is 5.991. The computed value of the test statistic is x7* = 4.607. It cannot be 
concluded that the CIA distribution is different from that reported in the sociological study. 
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CHI-SQUARE INDEPENDENCE TEST 


11.29 Five hundred arrest records were randomly selected and the records were categorized according to age 
and type of violent crime. The results are shown in Table 11.36. 


Table 11.36 
pow Murders) 5 
ae eae eT ae eee aes | es ee: ee 
2 


Find the table of expected frequencies assuming independence, and comment on performing the Chi- 


square test of independence. 


Ans. The expected frequencies are given in Table 11.37. The Chi-square test of independence 1s 
inappropriate because 37.5% of the expected frequencies are less than 5 and one is less than 1. 


Table 11.37 
| Murder | 19.14 | 0.75 | a7 | 
| Rape | 36.48 | 2050 8122 
| Robbery | 125.58 | 70.56 | 9.66 | 420 
| Assault | 7.816619 9.06 38.4 


11.30 Table 11.38 gives various cancer sites and smoking history for participants in a cancer study. 


Table 11.38 
po Smoking history 
15 En aes fe OO 
| Othercancer | 250 | 135, | 90 |S 


Find the expected frequencies, assuming that cancer site is independent of smoking history. 


Ans. The expected frequencies are shown in Table 11.39. 


Table_11.39 
| Light | Medium | Heavy 
32.82 


136.75 130.26 122.01 
958.18 912.75 854.93 


28.29 26.95 


SAMPLING DISTRIBUTION OF THE TEST STATISTIC FOR THE CHI-SQUARE 
INDEPENDENCE TEST 


11.31 Combine the categories “34-40” and “Over 40” into a new category called “Over 33” and perform the 
Chi-square independence test for problem 11.29. 
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Ans. The new contingency table is given in Table 11.40. The expected frequencies are shown in 
parentheses. 


Table 11.40 


| Murder}, 1 (19.14) 15 10.75) [6 (2.11) 
| Rape | 2 (36.48) | 26 (20.50) | 14 (4.03) 


Two of the cells have expected frequencies less than 5. Some statisticians would not perform the 
independence test because two of the expected frequencies are less than 5. The less restrictive rule 
states that the test is valid if all expected frequencies are at least one and at most 20% of the 
expected frequencies are less than one. The computed value of the test statistic is 71.918 and the p 
value is .000. The type of crime is most likely dependent on the age group. 


11.32 Compute the value of the test statistic for the test of independence in problem 11.30 and test the 
hypothesis of independence at a = OI. 


Ans. The degrees of freedom is df = (r —- 1) x (¢ - 1) = (4- 1) x (4 - 1) = 9. The critical value is 
21.666. The computed value of the test statistic is x’* = 327.960. The cancer site is dependent 
upon the smoking history of an individual. 


SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE 


11.33 Fill in the following parentheses with s or o. 
(a)( )is aconstant, but( _) is a variable. 
(b)( ) has a distribution, but ( _) does not have a distribution. 
(c)( )is a parameter and (_) is a statistic. 
(d)(__) describes the variability of some characteristic for a sample, whereas ( _) describes the varia- 
bility of a characteristic for a population. 


Ans. (a)O,s (b)s,0 (c)G,s (d)s,0 


7 


(n-1)S 
11.34. What distributional assumption is necessary in order that —~——— have a Chi-square distribution with 


(n — 1) degrees of freedom? 


Ans. The random sample from which S? is computed is assumed to be selected from a normal 
distribution. 


INFERENCES CONCERNING THE POPULATION VARIANCE 


11.35 Table 11.41 gives the monthly rent for 30 randomly selected 2-bedroom apartments in good condition 
from the state of New York. Find a 95% confidence interval for the standard deviation of all monthly 
rents in New York for 2-bedroom apartments in good condition. 


Table 11.41 
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Ans. The sample standard deviation is equal to $203.24. 
The Chi-square table values are rarer = 45.722 and X56 = 16.047 for 29 degrees of freedom. 
The 95% confidence interval extends from $161.86 to $273.22. 


11.36 Table 11.42 gives the ages of 30 randomly selected airline pilots. Use the data to test the null hypothesis 
that the standard deviation of airline pilots is equal to 10 years vs. the alternative that the standard 
deviation is not equal to 10 years. Use level of significance a = .01. 


Ans. The standard deviation of the sample is 8.83 years. The critical values are 13.121 and 52.336. The 
computed value of the test statistic is 22.61. The sample data does not contradict the null 
hypothesis. 


Table 11.42 


Chapter 12 


Analysis of Variance (ANOVA) 


F DISTRIBUTION 


Analysis of Variance (ANOVA) procedures are used to compare the means of several 
populations. ANOVA procedures might be used to answer the following questions: Do differences 
exist in the mean levels of repetitive stress injuries (RSJ) for four different keyboard designs? Is there 
a difference in the mean amount awarded for age bias suits, race bias suits, and sex bias suits? Is 
there a difference in the mean amount spent on vacations for the three minority groups: Asians, 
Hispanics, and African Americans? 

ANOVA procedures utilize a distribution called the F distribution. A given F distribution has 
two separate degrees of freedom, represented by df, and df2. The first, df), is called the degrees of 
freedom for the numerator and the second, df2, is called degrees of freedom for the denominator. 
For an F distribution with 5 degrees of freedom for the numerator and 10 degrees of freedom for the 
numerator, we write df = (df|, df.) = (5, 10). Figure 12-1 shows one F distribution with df = (10, 50) 
and another with df = (10, 5). 


0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 


Table 12.1 gives the basic properties of the F distribution. 


Table 12.1 


. The total area under the F distribution curve is equal to one. 

. An F distribution curve starts at 0 on the horizontal axis and extends 
indefinitely to the right, approaching but never touching the horizontal axis. 

. AnF distribution curve is always skewed to the right. 


df 5 for df, > 2, where df, is the 


. The mean of the F distribution is 1 = 


2 


degrees of freedom for the denominator. 
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df 
EXAMPLE 12.1 The F distribution with df = (10, 5) shown in Fig. 12-1 has mean equal to 1 = = 
d 


: df 50 
——— = 1.67, and the F distribution shown in the same figure with df = (10, 50) has mean p = a 
5-2 df2-2 50-2 
1.04. 

F TABLE 


The F distribution table for 5% and 1% right-hand tail areas is given in Appendix 5. A technique 
for finding F values for the 5% and 1% left-hand tail areas will be given after illustrating how to read 
the table. 


EXAMPLE 12.2 Consider the F distribution with df, = 4 and df, = 6. Table 12.2 shows a portion of the F 
distribution table, with right tail area equal to .05, given in Appendix 5. 


Table 12.2 


a. 
ne? 


1 
2 
3 
4 
5 
6 


Table 12.2 indicates that the area to the right of 4.53 is .05. This is shown in Fig, 12-2. The F distribution 
table with right tail area equal to .01 indicates that the area to the right of 9.15 is .01. This is shown in Fig. 
12-3. 


0.6 0.6 
0.5 0.5 
0.4 0.4 
0.3 Area = .05 0.3 
Area = .01 
0.2 0.2 
0.1 0.1 
0.0 0.0 
F F 
4.53 9.15 
Fig. 12-2 Fig. 12-3 


The following notation will be used with respect to the F distribution. The value, 4.53, shown in Fig. 12-2, is 
represented as Fos (4, 6) = 4.53, and the value 9.15, as shown in Fig. 12-3 is represented as F 9; (4, 6) = 9.15. 
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Using the notation introduced in Example 12.2, the result shown in formula (/2./) holds for any 
F distribution. 


{ 
Fi-o(dfi. df2) =—————— (12.1) 
Fa(df2. dfi) 


EXAMPLE 12.3 Formula (/2./) may be used to find the F values for 1% and 5% left-hand tail areas for the F 
distribution discussed in Example 12.2. The symbol F95(4, 6) represents the value for which the area to the right 
of Fo5(4, 6) is .95 and therefore the area to the left of F95(4, 6) is .0S. Using formula (/2./), we find that 


{ l 
= — =0.1623 


F55(4, 6) =————_ = 
3 Fos(6, 4) 6.16 


The value F 95 (6, 4) = 6.16 is shown in bold print in Table 12.3, which is taken from the F distribution table. 


Table 12.3 


Similarly, the 1% left-hand tail area is 


| 1 
4, 6) = —_———__- = ——- = 0.0657 
Baie Fo(6, 4) 15.21 


LOGIC BEHIND A ONE-WAY ANOVA 


An experiment was conducted to compare three different computer keyboard designs with 
respect to their affect on repetitive stress injuries (rsi). Fifteen businesses of comparable size 
participated in a study to compare the three keyboard designs. Five of the fifteen businesses were 
randomly selected and their computers were equipped with design 1 keyboards. Five of the remaining 
ten were selected and equipped with design 2 keyboards, and the remaining five used design 3 
keyboards. After one year, the number of rsi were recorded for each company. The results are shown 
in Table 12.4. 


Table 12.4 
10 24 17 
10 
8 
10 
12 


26 


A Minitab dotplot of the data for the three designs is shown in Fig. 12-4. 
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keyboard 7 
1 . : F 
+--------- $---- eon to------H- tooo oe $--- +--+ toe-e-e rsi 
keyboard . 
2 . 3 . 
t--------- +--------- $o----- to-------- fo-------~ t------- rsi 
keyboard é 
3 . H . 
+--------- to-------- ¢------- HH tooo econ tooo +------- rsi 
7.0 10.25 14.0 17.5 21.0 24.5 


Fig. 12-4 


If uy, H2, and 3; represent the mean number of rsi for design 1, design 2, and design 3, 
respectively, for the population of all such companies, the data indicate that the population means 
differ. 

Suppose the data for the study were as shown in Table 12.5. A Minitab dotplot for these data is 
shown in Fig. 12-5. 


Table 12.5 
| Design! | __Design2__|__—Design3 
10 34 29 
12 14 17 
5 24 5 
] 19 10 
22 29 24 
|_mean=10 | mean=24 | mean=17 
keyboard 
1 e« eo e e e 
+-~------- to--- eo --e +--------- ¢o-----H-- t--------- to-----R-- rsi 
keyboard 
2 e e s . e 
+-~------- tooce sree to - Sen toot erm tooee econ toe-oe-- rsl 
keyboard 
3 ° e . . 
$---- 7H town to------e- $o----- oH to---e n-ne tones rsi 
0.0 7.0 14.0 21.0 28.0 35.0 


Fig. 12-5 


Note that for the data shown tn Fig. [2-5, it is not clear that the mean number of rsi differ for the 
three populations using the three different designs. The difference between the two sets of data can 
be explained by considering two different sources of variation. The between samples variation is 
measured by considering the variation between the sample means. In both cases, the sample means 
are x; = 10, x2 = 24, and x; = 17. The between samples variation is the same for the two sets of data. 
The within samples variation is measured by considering the variation within the samples. The 
dotplots indicate that the within samples variation is much greater for the data in Table 12.5 than for 
the data in Table 12.4. The measurement of these two sources of variation will be discussed in the 
next section. However, it is clear from the above discussion that a decision concerning the 
effectiveness of the three designs may be based on a consideration of the two sources of variation. In 
particular, it will be shown that the ratio of the between variation to the within variation may be used 
to decide whether }1), [l2, and [M3 are equal or not. 


276 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12 


EXAMPLE 12.4 Figures 12-6 and 12-7 show boxplots for the data shown in Tables 12.4 and 12.5, 
respectively. Notice that Fig. 12-6 suggests different population means, while Fig. 12-7 suggests that it is 
possible that the population means may be the same. 


TS! rsi 


eit = ee 
1 2 3 1 2 3 
keyboard keyboard 
Fig. 12-6 Fig. 12-7 


SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM 
FOR A ONE-WAY ANOVA 


The following notation will be used when samples from k different normally distributed 
populations having equal population variances are selected in order to test for the equality of the 
means of the k populations: 


The sample size, sample mean, and sample variance for the ith population are represented by 
Ni, Xi, and sé, respectively. The total sample size is n = n, + nz +: -- + ny. The overall mean 


for all n sample values is represented by x. The population mean for the ith population is 
represented by ut, and the standard deviation for the ith population is represented by g; . 


The between samples variation is measured by the between treatments mean square and is 
represented by MSTR. The expression for MSTR is given by formula (/2.2): 
SSTR 
MSTR. = (12.2) 
k~-| 
The numerator of formula (/2.2), SSTR, is called the treatment sum of squares, and is computed by 
using formula (/2.3): 


SSTR = n(%, — XY + mo K2 — KY) t+ - EY (12.3) 


The within samples variation is measured by the error mean square and is represented by MSE. 
The expression for MSE is given by 
SSE 
MSE = —— (12.4) 
n-k 
The numerator of formula (/2.4), SSE, is called the error sum of squares, and is computed by using 
formula (/2.5), where s?. s3...., sf are the sample variances. 
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SSE = (ny — 1)s? + (m— 1)s3 +--+ + (my - I) sf (12.5) 


The denominator of formula (/2.2), k — 1, is called the degrees of freedom for treatments and the 
denominator of formula (/2.4), n—k, is called the degrees of freedom for error. 

The sum of the treatment sum of squares and the error sum of squares is called the total sum of 
squares. The total sum of squares is represented by SST and is given by 


SST = SSTR + SSE (12.6) 


The total sum of squares may be computed directly by using formula (/2.7), where the sum is 
over all n sample values. The degrees of freedom for total is equal ton -— 1. 


SST = L(x -x) (12.7) 


EXAMPLE 12.5 The data from Table 12.4 is listed below for convenient reference. 


10 24 17 


22 17 
24 15 
24 19 


10 

8 

10 

12 26 17 
[ean =10 [mean =38 | mean=17 


Since the three sample sizes are equal, the overall mean is computed by finding the mean of the three treatment 
(design)means. The mean of 10, 24, and 17 is 17, and therefore xX = 17. The total sum of squares will be found 
first. 


SST = L(x -X)’ 
= (10-17) + (10-17)? + (8 — 177 + (10 - 17)? + (12 — 17)? + (24 — 17 + (22 - 17)? + (24 - 177 
#:(24= 19)" 4065 177 £07 =17Y 4 (17 = 17 + 1S = 17)" 4 (19 = 17) 47 S177 
= 494+ 494814494254 49+ 254494 49+ 814+04+0+44+4+0 
= 514 


The treatment sum of squares is computed by using formula (/2.3). 


SSTR = 1 (X) — X)° + m(X2 — K)? + my (K3 - KY 
§x (10-177 +5 x (24-177 +5 x (17-17) 
5x49+5x4945x0 


490 


We need to find the three sample variances before finding the error sum of squares. The formula for the 
sample variance is applied to each sample separately as follows: 
g? = (10 - 10)’ + (10 - 10)* + (8 - 10)’ + (10- 10)" + (12- 10)" _ 
5] 
g} = 24- 24)" + (22 - 24)’ + (24 — 24)" + (24 — 24)" + (26-24)' _ 5 
S='l 
gi (T= 12) #0719) PS S17) £9 = 17) UT 217). 
, 5-1 
The error sum of squares is computed using formula (/2.5). 


SSE = (n; — 1)s? + (n2- 1)s3 + (n3— 1)s4 
=4x2+4x2+4x2=24 
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Note that SST = SSTR + SSE. It is clear from this example that the computation of the various 
sums of squares is rather time consuming and subject to computational errors. Computer software is 
usually employed to perform these computations. This will be discussed later. If it is necessary to 
perform the computations by hand, shortcut formulas are recommended. These shortcut formulas will 
now be discussed. The shortcut formula for computing the total sum of squares is given in (/2.8): 


(Ix)? 


SST = Ix’ ~ (12.8) 


n 


The treatment sum of squares is computed by using formula (/2.9), where T; is the sum of the 
sample values for the ith treatment. 


2 2 
SSTR= pT - fox). (12.9) 
i n 


After SST and SSTR are computed, SSE is found by using formula (/2./0): 
SSE = SST - SSTR (72.10) 


EXAMPLE 12.6 To find the sum of squares found in Example 12.5 using the shortcut formulas, refer to the 
following table. 


10 24 


10 22 


Design 3 


To find SST, it is necessary first to find Zx and Ix? 
2x= 10+ 104+8+4 10+ 124244 22+244+244264+ 174+174+ 154194 17 = 255 


Ix’ = 100 + 100 + 64 + 100 + 144 + 576 + 484 + 576 + 576 + 676 + 289 + 289 + 225 + 361 + 289 = 4849 


(Ix)? 


SST = =x? - = 4849 ~ 4335 = 514 


The treatment sum of squares is found next. 


(Ixy 50? 120? 88 
~ —— = —+-— +— ~_ 4335 = 4825 — 4335 = 490 
n 5 5 5 


2 
SSTR= yu 
Ni 
and, then the error sum of squares is found by subtraction: 


SSE = SST - SSTR = 514 - 490 = 24 


The degrees of freedom for total is n~- | = 14, the degrees of freedom for treatments is k — 1 = 2, and the 
degrees of freedom for error is n—k = 15 —- 3 = 12. 
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SSTR = 490 
The between treatments mean square is MSTR = —-—— = —— = 245, and the error mean square is MSE = 
SSE 24 ; cor 
= — = 2. This example illustrates that it is much casier to compute the sums of squares using the 
n—-k 12 


shortcut formulas than it is to use the defining formulas. Most researchers use statistical software to perform 
these computations. 


SAMPLING DISTRIBUTION FOR THE ONE-WAY ANOVA TEST STATISTIC 


The purpose behind a one-way ANOVA is to determine if the means of k populations are equal 
or not. In particular, we are interested in testing the null hypothesis Ho: MW; = H2 =--- =H, against the 
alternative hypothesis H,: All k means are not equal. When the k samples are randomly and 
independently selected from normal populations having equal population standard deviations (or 
equivalently equal population variances), and the null hypothesis is true, the ratio of the treatment 
mean square to the error mean square has an F distribution with df; = k —- I, and df, =n—-—k. We 
express this as shown in formula (/2.//). This ratio, sometimes referred to as the F ratio, is the test 
statistic used to test the above hypotheses. 


MSTR 
F= (12.11) 
MSE 


EXAMPLE 12.7 In Examples 12.5 and 12.6, three different keyboard designs were compared with respect to 
the number of rsi found at 15 companies using the three designs. The test statistic given in formula (/2.//) has 
an F distribution with dfj =k - | =3-1=2, and dfs =n-—k= 15-3= 12. In Example 12.6, it was shown that 


MSTR = 245 and MSE = 2. The computed value of the test statistic is F* = 8 = 122.5. From the F distribution 


table in Appendix 5, the 5% and 1% right-hand tail critical values are as follows: Fos(2. 12) = 3.89 and 
F (2, 12) = 6.93. The extremely large value of the test statistic suggests that the population means are not equal 
and that the null hypothesis should be rejected. The differences in the sample means are significant, and tndicate 
that design | may reduce the number of rsi. 


BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY 
OF MEANS 


The results of the computations in the proceeding sections are usually conveniently displayed in 
a one-way ANOVA table. The general structure of the one-way ANOVA table is given in Table !2.6. 


Table 12.6 


a ee ee ee ee MS = SS/df 


Tecauneai k- SSTR MSTR 
Error n- : SSE MSE 


EXAMPLE 12.8 The computations in Examples 12.6 and 12.7 are summarized in the ANOVA Table 12.7. 


Table 12.7 


}—_Source__f aff ss MS = SS/df 
Treatment 2 
Error 12 


122.5 


The steps in testing the equality of means is summarized in Table 12.8. 
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Table 12.8 
Steps for Testing the Equality of Means 
Using the One-Way ANOVA Procedure 


Step I: State the null and alternative hypothesis as follows: 


Ho: Hr = Hz = = Uk 
H,: All k means are not equal. 


Step 2: Use the F distribution table and the level of significance, a, to determine the 


rejection region. 


Step 3: Build the ANOVA Table, and from the table determine the computed value of 
the F ratio. 


Step 4: State your conclusion. The null hypothesis is rejected if the computed value of 
the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. 


Computer software is routinely used to perform the necessary computations to compute the test 
Statistic used to test the equality of means. The software usually produces the ANOVA table and in 
addition provides a p value for the test. When the p value is given, the null hypothesis is rejected if 
the p value is less than the preset level of significance. 


EXAMPLE 12.9 Two different Minitab commands will now be discussed which produce a one-way ANOVA 
table. The data for the number of rsi for the three keyboard designs is reproduced below. 


The Minitab command aovoneway cl — c3 requires that the data for the three designs be put into three separate 
columns. These data were put into columns cl, c2, and ¢3. 


Row design! design2  design3 
24 17 


1 10 

2 10 22 17 
k) 8 24 15 
4 10 24 19 
5 12 26 17 


MTB > aovoneway cl ~ c3 
One-Way Analysis of Variance 
Analysis of Variance 


Source DF SS MS F P 
Factor 2 490.00 245.00 122.50 0.000 
Error 12 24.00 2.00 

Total 14 514.00 


Individual 95% CIs For Mean 
Based on Pooled St Dev 


Level N Mean St Dev a o-------- y eater er reer 
designl 5 10.00 «1.414 (--*--) 
design2 5 24.00 1.414 (ok) 
design3 5 17.00 1.414 (--*--) 

waepace nee e= fu-n--2--- -poo------- +--- 


10.0 15.0 200 25.0 
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The above ANOVA table is the same as the one given in Table 12.7. The Minitab output shows the p value 
associated with the test under the column labeled P. Recall from our previous discussion of p value, that the p 
value is the area to the right of F = 122.50 for an F distribution having df, = 2 and df, = 12. Or, stated in words, 
if the null hypothesis is true, i-e., if 1, = U2 = Wy, the probability of obtaining an F value this large or larger is 
.000. For this reason, the null hypothesis would be rejected. 

The second Minitab command which produces a one-way ANOVA table is oneway response in c2 
treatment in cl. The treatment groups are identified in one column and the responses are given in another 
column. The ouput is the same for both commands. The required data setup differs for the two commands. 


Row design — rsi 
1 


WWWWWNNNNND | ee = 
N 
rs 


MTB > oneway response in c2 treatment in cl 
One-Way Analysis of Variance 

Analysis of Variance for rsi 

Source DF SS MS F P 
Design 2 490.00 245.00 122.50 0.000 
Error 12 24.00 2.00 

Total 14 514.00 


EXAMPLE 12.10 Before ending our discussion of the one-way ANOVA, it is instructive to compare the 
ANOVA tables for the two sets of data given in Tables 12.4 and 12.5 and illustrated in Figs. 12-4 and 12-5 as 
well as Figs. 12-6 and 12-7. The ANOVA for the data in Table 12.4 is given above in Example 12.9. The 
Minitab output for the ANOVA Table corresponding to the data in Table 12.5 is 


One-Way Analysis of Variance 


Analysis of Variance 

Source DF SS MS F P 
Factor 2 490.0 245.0 3.30 0.072 
Error 12 890.0 74.2 

Total 14 1380.0 


The ANOVA for the data in Table 12.5 does not indicate a difference in the three designs for & = .05, since p = 


0.072. Thus, the statistical test for the hypotheses concerning the means confirms what Figs. 12-4 and 12-5 as 
well as Figs. 12-6 and 12-7 suggested. 


LOGIC BEHIND A TWO-WAY ANOVA 


Suppose that rather than considering only the effect of the factor keyboard design, we would 
also like to consider the effect of the factor seating design on repetitive stress injuries (rsi). The one- 
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way ANOVA allows us to analyze the effect of only one factor on the response variable, rsi. A two- 
way ANOVA allows one to analyze the effects of two factors on a response variable. Suppose there 
are three different keyboard designs and three different seating designs we are interested in testing. A 
3 x 3 factorial design is one in which each keyboard design is matched with each seating design for a 
total of nine treatment combinations. Such a design allows one to test the main effects for keyboard 
designs and seating designs as well as test for interaction between the two factors. Suppose 18 
businesses of comparable size are selected for the study. A randomization scheme is used and each of 
the nine treatments are used at two businesses. The number of rsi are recorded at each location and 
the results are shown in Table 12.9. 


Table 12.9 


Keyboard design 
| Seatingdesign{ st | CT 


ae 10, 12 20, 22 16, 18 
2 14,16 25,27 20, 22 
3 8, 8 18, 16 14, 16 


EXAMPLE 12.11 Table 12.10 is a table of means for the results shown in Table 12.9. The mean for each 
treatment (Keyboard-Seating design combination) is given as well as the marginal means. It is clear that 
keyboard design 2 with seating design 2 resulted in the largest mean number of rsi. Also, keyboard design | with 
seating design 3 resulted in the smallest mean number of rst. 


Table 12.10 


Keyboard design 
3 


A graphical technique for analyzing the results of the experiment is a main effects plot. A Minitab main effects 
plot is shown in Fig. 12-8. The main effects plot for the factor, seating, is a plot of the row means given in Table 
12.10. Each of these means is calculated for six businesses. The main effects plot for the factor, keyboard, is a 
plot of the column means given in Table 12.10. Each of these means is calculated for six businesses. 


Main Effects Plot - Means for rsi 


18.5 
16.0 


13.5 


seating keyboard 


Fig. 12-8 
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Another plot used to help explain the experimental results is called the interaction plot. A Minitab interaction 
plot is shown in Fig. 12-9. The interaction plot indicates that regardless of the seating design used, keyboard 
design 1 results in the smallest number of rsi, keyboard design 2 results in the largest number of rsi, and the 
number of rsi for keyboard design 3 is between the other two. When the line segments are roughly parallel, as 
they are here, we say that there is no interaction between the factors. 


Interaction Plot - Means for rsi 


seating 


1 2 3 
keyboard 


Fig. 12-9 


EXAMPLE 12.12 Suppose the experiment described in Example 12.11 resulted in the data shown in Table 
12.11. 


Table 12.11 
Keyboard design 
| Seatingdesign | =| | 2 


The Minitab main effects plot is shown in Fig. 12-10 and the Minitab interaction plot is shown in Fig. 12-11. An 
understanding of interaction is accomplished by comparing Fig. 12-11 with Fig. 12-9. Recall that in Fig. 12-9, 
we were able to conclude that regardless of the seating design used, keyboard design | results in the smallest 
number of rsi, keyboard design 2 results in the largest number of rsi, and the number of rsi for keyboard design 
3 is between the other two. Notice that this is not true in Fig. 12-11. In particular, when seating design | is used 
the smallest mean number of rsi occur for keyboard design 3, not keyboard design 1. In this study, the factors 
keyboard design and seating design are said to interact, Note that the line segments on the interaction plot are 
not parallel. The keyboard design which produces the minimum number of rsi depends on the seating design 
used. We cannot conclude that keyboard design | always produces the minimum number of rsi. When 
interaction is present, the main effects must be interpreted carefully. 


The discussion and examples in this section are intended to provide some insight into the logic 
behind a two-factor factorial design. In the next section the ideas will be generalized and the analysis 
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of variance associated with the two-factor design will be discussed. This section is intended to make 
the general discussion easier to understand. 


Main Effects Plot - Means for rsi 


seating keyboard 


Fig. 12-10 


Interaction Plot - Means for rsi 


seating 


keyboard 


Fig. 12-11 
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SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM 
FOR A TWO-WAY ANOVA 


Let A and B be two factors whose influence on a response variable we wish to investigate. Let a 
be the number of /evels of factor A and let b be the number of levels of factor B. Each level of factor 
A is combined with each level of factor B to form a total of ab treatments. Equal size samples are 
taken from each of the ab treatments. If each sample is of size m, the total number of sample values 
is n = mab. 


EXAMPLE 12.13 Table {2.12 gives the compensatory awards (in thousands)for wrongful firing based on age, 
race, or disability for males and females. Either bias type or gender may be called factor A. Suppose we let 
factor A be bias type. Then A has three levels and B has two levels. There are six populations or treatments. The 
terms population, treatment, or treatment combination are used for the six groups. We shall use the term 
treatment. The six treatments are: (age bias and male), (race bias and male), (disability bias and male), (age bias 
and female), (race bias and female), and (disability bias and female). The values for a, b, m, and n are 3, 2, 3, 
and 18, respectively. 


Table 12.12 
Disability 


Male 200, 175, 215 150, 125, 135 100, 95, 115 
Female 185,210,225 | 130, 145, 115 90, 80, 110 
Initially, a two factor factorial design may be analyzed using a one-way ANOVA. The number of 


treatments is k = ab, and the total sample size is n. The one-way ANOVA for the two factor design is 
given in Table 12.13. 


Table 12.13 


df MS = SS/df 
F 


Treatment SSTR MSTR 
MSE 


Error 
The Treatment sum of squares, SSTR, is now partitioned in three parts as shown in formula (/2./2), 
where SSA is called factor A sum of squares, SSB is called facter B sum of squares, and SSAB is 
called interaction sum of squares. 


3 2} 
i om 
C—_— 
<2) 
wr 
ey) 


SSTR = SSA + SSB + SSAB (12.12) 
The treatment degrees of freedom, ab — |, is also partitioned into three parts as follows: 
df for A=a-—1, df for B=b-—1, df for AB = (a-— 1)(b— 1) 


The degrees of freedom for A, B, and AB add up to the degrees of freedom for treatment as given by 
formula (/2./3): 


ab — 1 = (a— 1)+ (b- 1)+ (a— 1)(b- 1) (12.13) 


Mean squares are also defined for A, B, and AB. The factor A mean square, MSA, is given by 
formula (/2./4): 


SSA 
MSA = a (12.14) 
a _— 
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The factor B mean square, MSB, is given by 


SSB 
MSB:=——— (12.15) 
b-1 
The interaction mean square, MSAB, is given by 
SSAB 
MSAB = —————— (12.16) 


(a—1)(b- 1) 


The formulas for the sum of squares for A, B, and AB will not be discussed since the analysis of 
factorial designs are almost always performed by computer software. 


EXAMPLE 12.14 In Example 1!2.13, the total degrees of freedom is n-— 1 = 18 - | = 17, the treatment degrees 
of freedom is ab — 1 = 6 — 1 = 5S, and the error degrees of freedom is n ~ ab = 18 — 6 = 12. Furthermore, the 5 
degrees of freedom for treatments is partitioned into a- 1 = 3 — 1 = 2 for factor A, b— | = 2 — 1 = 1 for factor 
B, and (a— 1)(b — 1)= 2 x 1 = 2 for interaction. 


BUILDING TWO-WAY ANOVA TABLES 


The sum of squares, mean squares, degrees of freedom, and test statistics for a factorial design are 
summarized in a two-way ANOVA table. The structure for a two-way ANOVA table is given in 
Table 12.14. 


Table 12.14 
Factor A a-1 F, = MSA/MSE 
Factor B b-1 Fp = MSB/MSE 


Interaction (a -1)(b- 1) Fag = MSAB/MSE 
Error n-ab 


[Total] nt SST 


EXAMPLE 12.15 A two-way ANOVA table for the data given in Table 12.12 is given in Table 12.15. 


Table 12.15 
p= Seep SS Sais 


Bias type 
Sex 
Interaction 


Error 


SAMPLING DISTRIBUTIONS FOR THE TWO-WAY ANOVA 


When there is no interaction between factors A and B, the test statistic Fag has an F distribution 
with df, = (a — 1)(b — l)and df, = n — ab. When there are no differences among the means for the 
main effect A, the test statistic F, has an F distribution with df; = a-— I and df, = n — ab. When there 
are no differences among the means for the main effect B, the test statistic Fg has an F distribution 
with df; = b — | and df, =n — ab. 
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When there are no differences among the means for the factor A, we say there is no main effect 
due to factor A. Otherwise, we say there is a main effect due to factor A. Similar terminology is used 
for factor B. 


EXAMPLE 12.16 In Example 12.15, the test statistic F, has an F distribution with df, = 2 and df, = 12. The 
test statistic Fy has an F distribution with df, = 1 and df, = 12. The test statistic Fag has an F distribution with 
df, = 2 and df, = 12. These distributions may be used to test for significant interaction and main effects and is 
discussed in the next section when the computer analysis of the data is discussed. 


TESTING HYPOTHESIS CONCERNING MAIN EFFECTS AND INTERACTION 


The steps for testing interaction and main effects in a two-factor factorial design is summarized 
in Table 12.16. 


Table 12.16 


Steps for Testing for Interaction and Main Effects 
Using the Two-way ANOVA Procedure 
Step !: State the null and alternative hypothesis for intcraction and main effects as 


follows: 


Ho: There is no interaction between factors A and B 
H,: Factors A and B interact 


Ho: There are no differences among the means for main effect A 
H,: At least two of the main effect A means differ 


Ho: There are no differences among the means for main effect B 
H,: At least two of the main effect B means differ 


Step 2: Use the F distribution table and the level of significance @ to determine the 
rejection regions. 


Step 3: Build the ANOVA table and from the table determine the computed values of the 
test statistics. 


Step 4: State your conclusions. The null hypothesis is rejected if the computed value of 
the test statistic falls in the reyection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 12.17 Table 12.9 gave the number of repetitive stress injuries (rsi)for a 3 by 3 factorial design and 
the table is reproduced below. The Minitab analysis for this experiment is also shown. 


Keyboard design 
| Seating design | tT 8 


Two different Minitab commands will be discussed to analyze the data. The first to be discussed ts the command 
Twoway. First of all note the form of the data as given below. The data in the above table must be put in the 
form shown before performing the analysis. Each data line identifies the row, the column, and the data value in 
that row and column. 
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Row seating keyboard rsi 
1 1 1 10 
2 1 1 12 
3 1 2 20 
4 1 2 22 
5 1 3 16 
6 1 3 18 
7 2 1 14 
8 2 1 16 
9 2 2 25 
10 2 2 27 
11 2 3 20 
12 2 3 22 
13 3 1 8 
14 | 1 8 
15 3 2 18 
16 3 2 16 
17 3 3 14 
18 3 3 16 


The command Twoway 'rsi' 'seating' ‘keyboard’; starts with the keyword Twoway, followed by the response 
and then the row and column names. The subcommand Means ‘seating’ 'keyboard’. gives row and column 
means and confidence intervals. 


MTB > Twoway 'rai' ‘seating’ ‘keyboard’; 
SUBC > Means ‘seating' '‘keyboard'. 


Two-Way Analysis of Variance 
Analysis of Variance for rsi 


Source DF SS MS 
seating 2 163.11 81.56 
keyboard 2 307.11 153.56 
Interaction 4 4.89 1.22 
Error 9 16.00 1.78 
Total 17 491.11 
Individual 95% CI 
seating Mean a f--------- $oa-n aH $--------- 
1 16.33 (Seer aoe) 
2 20.67 (----*----) 
3 13.33 (----*----) 
--po------ =H $--------- $o--------- +--------- 
12.50 15.00 17.50 20.00 
Individual 95% CI 
keyboard Mean ------- $--------- $--------- +--------- $o--- 
1 11.33 (---*---) 
2 21.33 (---*---) 
3 17.67 {---*---) 
------H- +-------H-- +--------- +--------- +---- 
12.00 15.00 18.00 21.00 


The test statistic value for the null hypothesis Hp: No interaction between factors A and B is easily 
computed, since MSAB = 1.22 and MSE = 1.78. The computed value for Fap* is 1.22/1.78 = 0.69. The critical 
value for @ = .05 is Fo5(4, 9) = 3.63. Since Fag* does not exceed 3.63, the null hypothesis is not rejected. 
Therefore, interaction is not significant. This confirms what the interaction plot in Fig. 12-9 indicates. 

Suppose we call keyboard design factor A. The test statistic value for the null hypothesis Ho: There are no 
differences among the means for main effect A is easily computed, since MSA = 153.56 and MSE = 1.78. The 
computed value for F,* is 153.56/1.78 = 86.27. The critical value for & = .05 is Fos(2, 9) = 4.26. Since F4* far 
exceeds 4.26, we reject the null hypothesis and conclude that the means for the three keyboard designs differ. 

The seating design is factor B. The test statistic value for the null hypothesis Ho: There are no differences 
among the means for main effect B is easily computed, since MSB = 81.56 and MSE = 1.78. The computed 
value for Fg* is 81.56/1.78 = 45.82. The critical value for @ = .05 is Fos(2, 9) = 4.26. Since Fa* far exceeds 
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4.26, we reject the null hypothesis and conclude that the means for the three seating designs differ. The main 
effects plot in Fig. 12-8 indicates that seating design 3 and keyboard design | is a good combination for 
reducing repetitive stress injuries. 

A second Minitab command for doing the analysis is the command ANOVA 'rsi' = ‘seating’ 'keyboard' 
'seating'*'keyboard'. The output below is produced by this command. This output may be preferable, since it 
provides p values. By considering the p values, we see immediately that interaction is not significant, but that 
both main effects are significant. 


MTB > ANOVA 'rsi' = 'seating' 'keyboard' 'seating'*'keyboard' 


Analysis of Variance (Balanced Designs) 


Factor Type Levels Values 
seating fixed 3 1 2 3 
keyboard fixed 3 1 2 3 


Analysis of Variance for rsi 


Source DF Ss MS F P 
seating 2 163.111 81.556 45.87 0.000 
keyboard 2 307.111 153.556 86.37 0.000 
seating*keyboard A 4.889 1.222 0.69 0.619 
Error 9 16.000 1.778 

Total 17 491.111 


EXAMPLE 12.18 The data from Table 12.11 are reproduced below. 


Keyboard design 
-Seating design f___1_—_}__2-_}__3__ 


The Minitab analysis for these data is shown below. 
MTB > ANOVA 'rsi' = ‘seating’ ‘keyboard' 'seating'*'keyboard' 


Analysis of Variance (Balanced Designs) 


Factor Type Levels Values 

seating fixed 3 1 2 3 

keyboard fixed 3 1 2 3 

Analysis of Variance for rsi 

Source DF SS MS F P 
seating 2 177.333 88.667 49.88 0.000 
keyboard 2 161.333 80.667 45.38 0.000 
seating*keyboard 4 65.333 16.333 9.19 0.003 
Error 9 16.000 1.778 

Total 17 420.000 


From this output, we immediately see that there is significant interaction. This confirms what the interaction plot 
shows in Fig 12-11. It would be a good idea for you to review the discussion given in Example 12.12 
concerning the nature of the interaction. Remember, when interaction is present, the main effects must be 
interpreted carefully. 


EXAMPLE 12.19 The 3 by 2 factorial data for the factors bias type and gender given in Table 12.12 are 
reproduced on the next page. 
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Male 200, 175,215 150, 125, 135 100, eat 115 
Female 185, 210, 225 130, 145, 115 90, 80, 110 


The Minitab output for the Table 12.12 data is given below. 


MTB > ANOVA ‘award’ = ‘gender' ‘biastype' ‘gender'*'biastype' 
Analysis of Variance (Balanced Designs) 

Factor Type Levels Values 

gender fixed 2 1 2 

biastype fixed 3 1 2 3 

Analysis of Variance for award 

Source DF SS MS F P 
gender 1 22.2 22.2 0.09 O.774 
biastype 2 33144.4 16572.2 64.50 0.000 
gender*biastype 2 344.4 1723.2 0.67 O.530 
Error 12 3083.3 256.9 

Total 17 36594,4 


[CHAP. 


12 


The p values indicate that there is no significant interaction. There is no significant effect due to gender. 


However, the mean awards differ according to the type of bias. 


Solved Problems 


F DISTRIBUTION 


12.1. What does it mean to say that the F distribution curve is asymptotic to the horizontal axis of 


the rectangular coordinate system? 


Ans. This means that as we move to the right from the origin along the horizontal axis, the F distribution 


curve approaches the horizontal axis, but never touches it. 


F TABLE 


12.2 Find the following using the F distribution in Appendix 5. 
(a) Fo(7, 3) (b) Fos(4, 6) 


Ans. (a) 27.67 (b) 4.53 


12.3 Find the following using the F distribution in Appendix 5. 
(a) Fo(7,3) (b) Fos(4, 6) 


Ans. (a) Foo(7, 3) = a —— =0.1183 (6) Fos (4,6) = 


Fo(3, 7) 845 | Fys(6, 4) aC 


12.4 Find the critical value of F for the following. 
(a) df = (4, 8) and area in the right tail equal to .O1. 
(b) df = (6, 6) and area in the right tail equal to .05. 


Ans. (a) 7.01 (b) 4.28 


= 0.1623 
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LOGIC BEHIND A ONE-WAY ANOVA 


12.5 Eighteen individuals, diagnosed as having mild high blood pressure, were randomly divided 
into three groups of six each and were assigned to one of three treatment groups. The treatment 
1 group served as a control and made no changes in their diet. The individuals in the treatment 
2 group replaced 50% of their diet with fruits and vegetables. The individuals in treatment 
group 3 replaced 50% of their diet with fruits and vegetables and reduced fat calories to 15% 
of their daily total. After six months, the change in diastolic blood pressure was measured and 
the results are given in Table 12.17. The change in blood pressure was determined by 
subtracting the reading after the six-month study from the initial reading. 


Table 12.17 


Treatment 2 Treatment 3 
6 


11 


Treatment | 


4 


A Minitab dotplot of the data given in Table 12.17 is shown in Fig 12-12. Describe in words 
what the dotplot suggests concerning the three treatments. 


trtment 1 

sete moe $o-------K- toe eee to-------H-- +--------- bp 
trtment 2 

------- [ena Tee, aera nr eee ee te eR REE 
trtment 3 

seta eee ee ee ee eee Ee cee 

0.0 Pea ol 105 14.0 oy eee 
Fig. 12-12 


Ans. The plot suggests that the means for treatments 2 and 3 are greater than the mean for treatment 1. 
It is not clear whether the means for treatments 2 and 3 are different. 


12.6 Table 12.18 is another set of data obtained for the same experiment described in Problem 12.5. 
A Minitab dotplot of the data is shown in Fig. 12-13. Describe in words what the dotplot 
suggests concerning the three treatments. 


Table 12.18 


Treatment | Treatment 2 Treatment 3 
0 


0 


16 ys 
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A Minitab dotplot of the data is shown in Fig. 12-13. Describe in words what the dotplot 
suggests concerning the three treatments. 


trtment 1 
-o-c--H- +---------4~---~- ~~ -- t-- ~~~ ¢-~-- ~- ~~ -4+~--- --~--bp 


trtment 2 


trtment 3 


------- $o---- oe H$¢--- et ------- -- +---------+---------bp 
-6.0 0.90 6.0 12.0 18.0 24.0 
Fig. 12-13 


{It is not clear whether the treatment means differ or not since the data points overlap for the three 
treatments. 


SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM 
FOR A ONE-WAY ANOVA 


12.7. For the data in Table 12.17, find SST, SSTR, and SSE using both the defining and the shortcut 
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of 
freedom for total, treatments, and error. 


Ans. 


Defining formulas: 

SSTR =n) — XY + A(X. — K) + mK - XY). OL - 7) + 69-7) + OL L - 7)" = 336 
SSE = (n, — 1)s? + (no — 1)s3 + (my — 183 = 5 x 2.3667 + 5 x 2.098" + 5 x 3.688" = 118 

SST = SSTR + SSE = 336 + 118 = 454 


Shortcut formulas: 


2 
< 


> (2x) 
SST = £x* - = 1336 - 882 = 454 
n 
2 (Lx)? 2 2 2 
ssTRe yee Og St OO Leer e556 
n, n 6 6 
SSE = SST — SSTR = 454 - 336 =118 
Degrees of freedom: for total = n - | = 17, for treatments = k — | = 2, for error = n—-k= 15 
k-] 
MSE = 2ek = 7.87 
n-k 


12.8 For the data in Table 12.18, find SST, SSTR, and SSE using both the defining and the shortcut 
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of 
freedom for total, treatments, and error. 
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Ans. Defining formulas: 
SSTR = n(x, — X)° + mK. — XK) + 3K - K)- 6(1 — 7) + 69 - 7) + OCI — 7)’ = 336 
SSE = (nm — 1)s? + (np — 1)s3 + (ny — 1)83 = 5 x 7.23974 5 x 7.483? + 5 x 7.797? = 846 
SST = SSTR + SSE = 336 + 846 = 1182 


Shortcut formulas: 


2 


_ yn ex | = 
SST = Ex* - = 2064 — 882 = 1182 


(xy _ 6 58, 66" 


2 
SSTR= D4 - 2+ 5 + OE 982 = 336 
| ¢ | 


SSE = SST - SSTR = 1182 - 336 = 846 


Degrees of freedom: for total =n— 1 = 17, for treatments = k ~ 1 = 2, for error=n—k= 15 


MSTR = peal = 168 
k-1 
MSE = = = 56.4 


SAMPLING DISTRIBUTION FOR THE ONE-WAY ANOVA TEST STATISTIC 


12.9 (a) For the experiment described in problem 12.5, what conditions are necessary in order that 


MSTR rsa ee ‘ 
F= NSE have an F distribution with df, = 2 and df; = 15? 


(b) What is the computed value of F, F*, for the data in problem 12.5? 


(c) What is the critical value for testing equal means at @ = .05? Would you reject the null 
hypothesis at this significance level? 


Ans. (a) The populations represented by the three samples are normally distributed. The three popula- 
tions have equal variances. The null hypothesis is assumed to be truce. 


168 


(b) The computed value of the test statistic is F* = 7 87 = 21.35. 


(c) Fos (2, 15)= 3.68. Reject the null hypothesis because F* > 3.68. 


12.10 (a) For the experiment described in problem 12.6, what conditions are necessary in order that 


MSTR eee . 
F= NEE have an F distribution with df; = 2 and df, = 15? 


(b) What ts the computed value of F, F*, for the data in problem 12.6? 


(c) What is the critical value for testing equal means at @ = .0S? Would you reject the null 
hypothesis at this significance level? 
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Ans. (a) The populations represented by the three samples are normally distributed. The three 


populations have equal variances. The null hypothesis is assumed to be true. 


(b) The computed value of the test statistic is F* = a = 2.98. 


(c) Fos(2, 15)= 3.68. Do not reject the null hypothesis, since F* does not exceed 3.68. 


BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY 
OF MEANS 


12.11 Refer to Problems 12.5, 12.7, and 12.9. Use the results in these three problems to build the 


one-way ANOVA table for this experiment. 


Ans. The one-way ANOVA table is shown below. The ANOVA table is used to systematically 


summarize the lengthy computations for the procedure and to give the computed value of the test 
statistic. By referring to the F distribution table, we see that the null hypothesis of equal means for 
the three treatments would be rejected at both the .05 and the .O1 levels. Replacing 50% of their 
diet with frutts and vegetables appears to reduce the blood pressure for individuals with mild high 
blood pressure based on the results of this study. 


ANOVA Table for the Data in Table 12.17 


MS = SS/df 


Treatment 336 168 21.35 
| Total =f a? 


12.12 Refer to problems 12.6, 12.8, and 12.10. Use the results tn these three problems to build the 


one-way ANOVA table for this experiment. 


Ans. The one-way ANOVA table is shown below. The ANOVA table is used to systematically 


summarize the lengthy computations for the procedure and to give the computed value of the test 
statistic. By referring to the F distribution table, we see that the null hypothesis of equal means for 
the three treatments would not be rejected at either the .OS or the .01 levels. Based on the results 
of this study, we cannot claim that replacing 50% of their diets by fruits and vegetables will reduce 
the blood pressures for individuals with mild high blood pressure. 


ANOVA Table for the Data in Table 12.18 


MS = SS/df 


Treatment 336 168 2.98 
| Tol CoE 


LOGIC BEHIND A TWO-WAY ANOVA 


12.13 A medical study utilized a 3 by 2 factorial design. One factor was the type of surgical 


procedure and the other factor was the temperature of the patients’ environment during 
surgery. Some patients were kept warm with blankets and intravenous fluids and others were 
kept cool. Four patients were available for each surgical procedure-temperature combination. 
For each patient, the length of hospital stay was recorded and the results of the study are 
given in Table {2.19. 
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Table 12.19 
Surgical Procedure 
| Temperaure | ot | CT 


Warm 5, 6, 8,5 9,9, 10, 12 
Cool 9,10, 11, 10 13, 14, 14, 15 
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The mean for each treatment (Surgical Procedure-Temperature combination), as well as 


marginal means, is given in Table 12.20. 


Table 12.20 


Surgical a 


[Temperate [TT 
Cool 
Pesmmn a 


Row mean 


A Minitab main effects plot is shown in Fig. 12-14 and an interaction plot is shown in Fig. 


12-15. 


Main Effects Plot - Means for Days 


Tempo Surgical Procedure 


Fig. 12-14 


Explain in words the interaction plot and the main effects plot shown in Figures 12-14 and 


12-15. 


Ans. The solid line segment in Fig. 12-15 corresponds to patients who were kept warm during surgery 
and the dashed line corresponds to patients who were kept cool during surgery. The interaction 
plot indicates that regardless of surgical procedure, the mean length of hospital stay is less when 
the patient is kept warm during surgery since the solid line is always below the dashed line. This 
indicates that there is no interaction between temperature and the type of surgical procedure. The 
main effects plot for temperature shows that the mean length of stay for patients kept warm during 
surgery is less than the mean length of stay for those kept cool during surgery. The main effects 


plot for surgical procedure contrasts the mean lengths of stay for the three surgical procedures. 
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Interaction Plot - Means for days 


1 2 3 
Surgical Procedure 


Fig. 12-15 


12.14 Suppose the data in the design described in Problem [2.13 are as shown in Table 12.21. 


Table 12.21 


Surgical Procedure 
| Temperature | IL | CT 


Warm 2,3,3,4 9,10, 11, 10 9,9, 10, 12 

Cool 4,5, 5,6 5, 6, 8, 5 13, 14, 14, 15 
The mean for each treatment (Surgical Procedure-Temperature combination), as well as 
marginal means, is given in Table 12.22. 


Table 12.22 


[Temperaure [1 || 2 | 3 | _Rowmean 1] 
Cool 5 14 8.33 
[Coumnmean | 4 | 8 | | +d 


Explain in words the interaction plot and the main effects plot shown in Figs. 12-16 and 
12-17. 
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Main Effects Plot - Means for Days 


Fig. 12-16 


Interaction Plot —- Means for days 


1 2 3 
Surgical Procedure 


Fig. 12-17 


Ans. The interaction plot in Fig. 12-17 illustrates a totally different situation than the interaction plot in 
Fig. 12-15. The interaction plot in Fig. 12-17 indicates that keeping the patient warm during 
surgery shortens the length of hospital stay for patients undergoing surgical procedures | and 3, 
but lengthens the stay for those undergoing procedure 2. The response lines are not parallel and we 
say there is interaction present. This was not true in Fig. 12-15. When interaction is present, the 
main effects are more difficult to explain. 
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SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM 
FOR A TWO-WAY ANOVA 


12.15 


12.16 


Refer to problem 12.13. By initially considering the two-way design as a one-way design, 
give the degrees of freedom for total, treatments, and error. Then, break the degrees of 
freedom for treatments down into degrees of freedom for temperature, surgical procedure, and 
temperature-surgical procedure interaction. 


Ans. As a one-way design, the total degrees of freedom is n - | = 24 - | = 23. The degrees of freedom 
for treatments is ab - | = 2 x 3- 1 = S. The degrees of freedom for error is n - ab = 24 -6= 18. 


The 5 degrees of freedom for treatments is partitioned into a— 1 = 2 - | = 1] degree of freedom for 
temperature, b - | = 3 - | = 2 degrees of freedom for surgical procedures, and (a - 1)(b - 1) = 
1 x 2 = 2 degrees of freedom for interaction. 


Sixty plots were used in an agricultural experiment. Four varieties of wheat and five different 
levels of fertilizer were used in a 4 x 5 factorial experiment. Each variety of wheat and each 
fertilizer level combination was used on three randomly chosen plots. List the 20 treatments, 
give the complete breakdown for the total degrees of freedom, and express the total sum of 
squares as a sum of four separate sums of squares. 


Ans. One treatment would be the first variety of wheat combined with the first level of fertilizer, 
represented as (VI, Fl). The other 19 treatments are: (V1, F2), (V1, F3), (VI, F4), (V1, FS), (V2, 
F1), (V2, F2), (V2, F3), (V2, F4), (V2, F5), (V3, Fl), (V3, F2), (V3, F3), (V3, F4), (V3, F5), (V4 
F1), (V4, F2), (V4, F3), (V4, F4), and (V4, F5). 


The total degrees of freedom is n— 1 = 60 — | = 59. Let factor A be variety of wheat and fet factor 
B be fertilizer level. The degrees of freedom for A is a- 1 = 4 —- 1 = 3, the degrees of freedom for 
Bisb-1=5-1=4, the degrees of freedom for interaction is (a- 1) x (b- 1)=3x4= 12. The 
degrees of freedom for error is n - ab = 60 — 20 = 40. Note that 59 = 3 + 4 + 12 + 40. 


SST = SSA + SSB + SSAB + SSE. 


BUILDING TWO-WAY ANOVA TABLES 


12.17 Give the general structure for the two-way ANOVA table for the data in Table 12.19. 


Ans. The results are given in Table 12.23. 


Table 12.23 


}—Source [aff __ss__f Ms_ F Statist 


Procedure 
Temperature 


Interaction 
Error 


12.18 Give the general structure for the two-way ANOVA table for the experiment described in 


problem 12.16. 


Ans. The results are given in Table 12.24. 
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Table 12.24 


Variety 
Fertilizer level 
Interaction 
Error 40 


SAMPLING DISTRIBUTIONS FOR THE TWO-WAY ANOVA 


12.19 Find the critical values for Fa, Fg, and Fag in problem 12.17. Use @ = .05. 
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Ans. Factor A is the surgical procedure and factor B is the temperature of the patients’ environment. 


The test statistic F, has an F distribution with df, = 2 and df, = 18. The critical value for Fa is 
3.55. The test statistic F, has an F distribution with df, = 1 and df, = 18. The critical value for Fg 
is 4.41. The test statistic Fag has an F distribution with df, = 2 and df, = 18 and the critical value 
for Fag is 3.55. 


12.20 Find the critical values for Fa, Fg, and Fas in problem 12.18. Use a = .01. 


Ans. Factor A is the variety of wheat and factor B is the level of fertilizer. The test statistic F, has an F 


distribution with df, = 3 and df, = 40. The critical value for F, is 4.31. The test statistic Fg has an F 
distribution with df, = 4 and df, = 40. The critical value for Fg is 3.83. The test statistic Fag has an 
F distribution with df, = 12 and df, = 40 and the critical value for Fag is 2.66 


TESTING HYPOTHESIS CONCERNING MAIN EFFECTS AND INTERACTION 


12.21 The Minitab output for the data given in Table 12.19 is shown below. After reviewing 
Problems 12.13, 12.15, 12.17, and 12.19, as well as the Minitab output, test for main effects 


as well as interaction at « = .0S. 


Analysis of Variance for days 


Source DF SS MS F P 
Temp 1 66.667 66.667 60.00 0.000 
Surgproc 2 256.000 128.000 115.20 0.000 
Temp* Surgproc Z ore i. 2.667 2.40 0.119 
Error 18 20.000 1.111 

Total 23 348.000 


Ans. The Minitab output confirms the results discussed in problems 12.13, 12.15, 12.17, and 12.19. The 


p value for interaction, 0.119, is greater than & and interaction is not significant. This makes the 
interpretation of main effects easier. The p values for temperature and surgical procedure are both 
0.000. The mean length of hospital stay in days differs for the three surgical procedures and for the 


two operation temperatures at which the patients are kept. 


12.22 The Minitab output for the data given in Table 12.21 is shown below. After reviewing 
Problem 12.14, as well as the Minitab output, test for main effects as well as interaction at 


o=s05, 
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Analysis of Variance for days 


Source DF SS MS F P 
Temp 1 2.667 2.667 2.40 0.139 
Surgproc 2 256.000 128.000 115.20 0.000 
Temp* Surgproc 2 69.333 34.667 31.20 0.000 
Error 18 20.000 a rae ons De 

Total 23 348.000 


Ans. Since the interaction is significant, 1. e., the p value = 0.000 < a, we explain the nature of the 
interaction rather than make broad generalizations about the main effects. In Fig. 12-17, the solid 
line segment corresponds to the patients who were kept warm during surgery and the line segment 
made up of dashes corresponds to patients who were kept cool during surgery. The interaction plot 
suggests that the hospital stay is shorter if the patient is kept warm during surgery for procedures | 
and 3. However, it appears that for procedure 2, keeping the patient warm during surgery increases 
the hospital stay. 


Supplementary Problems 


F DISTRIBUTION 


12.23 Fill in the following blanks with the appropriate distribution. Choose from the words standard normal, 

student t, Chi-square, and F. 

(a) The ____ distribution is symmetrical about zero and has standard deviation equal to one. 

(b) The distribution is skewed to the right. The shape of the distribution curve is 
determined by the number of degrees of freedom. 

(c) The distribution 1s symmetrical about zero. The shape of the distribution curve is 
determined by the number of degrees of freedom. 

(d) The shape of the distribution is determined by two separate degrees of freedom. 


Ans. (a) standard normal (b) Chi-square (c)studentt (d)F 


F TABLE 


12.24 Find the following using the F distribution in Appendix 5. 
(a) Fo(2,8) (8) Fos(2, 8) 


Ans. (a) 8.65 (b) 4.46 


12.25 Find the following using the F distribution in Appendix 5. 
(4) Fo9(8, 2) (6) Fas(8, 2) 


Ans. (a) 0.1156 (6) 0.2242 


12.26 The random variable F,;, has an F distribution with df; = 5 and df, = 5. Find two values a and b such 
that P(a < Frais < b) = .90. 


Ans. a=0.1980 and b = 5.05 
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LOGIC BEHIND A ONE-WAY ANOVA 


12.27. Thirty individuals were randomly divided into 3 groups of 10 each. Each member of one group 
completed a questionnaire concerning the Internal Revenue Service (IRS). The score on the 
questionnaire is called the Customer Satisfaction Index (CSI). The higher the score, the greater the 
satisfaction. A second set of CSI scores were obtained for garbage collection from the second group, 
and a third set of scores were obtained for long distance telephone service. The scores are given in Table 
12.25. A Minitab boxplot for the data in Table 12.25 is shown in Fig. 12-18. Describe what the boxplot 
suggests concerning the mean CSI scores for the IRS, garbage collection, and long distance phone 
service. 


Table 12.25 


Garbage collection Long distance service 


Sass Et Laser. = ERO 
laaieeatonin I + I------- Garbage Collection 
eo Se seesS + [asasee Long 
------- Distance 

Service 

----+--------- 4--~------ to SSeS so 5 ea bin soa ae +--CSI 

40.0 48.0 56.0 64.0 72.0 80.0 
Fig. 12-18 


Ans. Even though the whiskers overlap for the three different groups, the boxes that contain the middle 
50% of the data, do not overlap for the three groups. Note that the plus signs are above the median 
values and are fairly spread apart. The data suggest that the mean CSI scores for the three 
populations are probably different. An analysis of variance may be performed to test the null 
hypothesis of equal population means. 


12.28 Suppose the survey described in problem 12.27 resulted in the data given in Table [2.26. Describe what 
the boxplot in Fig. 12-19 suggests concerning the mean CSI scores for the IRS, garbage collection, and 
long distance phone service. 
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Table 12.26 


nh el EN ene I + I-~- Garbage 
Siesta ete ee eae ee ees S Collection 


* secs E i (Petes eS Long 
Soest ese Distance 
Service 
--4--- ee toe He terror fs sieiienienienieniatesietioe teo-rer or +----CSI 
15 30 45 60 75 90 
Fig. 12-19 


Ans. The boxplots for the three sets of CSI scores shown in Fig. [2-19 exhibit a considerable amount of 
overlap. This is the type of results we might expect tf there is no difference in the three population 
means. The boxplots indicate that the assumption that the three populations have normal 
distributions with equal variances should be checked. 


SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM FOR A ONE-WAY ANOVA 


12.29 For the data in Table 12.25, find SST, SSTR, and SSE using both the defining and the shortcut 
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for 
total, treatments, and error. 


Ans. SSTR = 2761.7. SSE= 1035.0 SST =3796.7 MSTR = 1380.8 MSE = 38.3 
Degrees of freedom: for total = 29, for treatments = 2, and for error = 27. 


12.30 For the data in Table 12.26, find SST, SSTR, and SSE using both the defining and the shortcut 
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for 
total, treatments, and error. 


Ans. SSTR = 2761.7 SSE=12085 SST = 14847 MSTR= 1380.8 MSE = 448 
Degrees of freedom: for total = 29, for treatments = 2, and for error = 27. 
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SAMPLING DISTRIBUTION FOR THE ONE-WAY ANOVA TEST STATISTIC 


12.31 


12.32 


Refer to problems 12.27 and 12.29. If the populations of CSI responses concerning the IRS, garbage 
collection, and long distance phone service are normally distributed, have equal variances and equal 


means, what is the distribution of F = ? What is the a = .05 critical value for testing the null 


hypothesis that the population means are equal? Give the computed value of the test statistic and your 
conclusion. 


Ans. The test statistic F = has an F distribution with df; = 2 and df, = 27. The critical value, 


F 95(2, 27), is between 3.39 and 3.32. The computed value of the test statistic is F* = 36.05. The 
null hypothesis is rejected. 


Refer to problems 12.28 and 12.30. If the populations of CSI responses concerning the IRS, garbage 
collection, and long distance phone service are normally distributed, have equal variances and equal 


means, what ts the distribution of F = ? What is the a = .05 critical value for testing the null 


hypothesis that the population means are equal? Give the computed value of the test statistic and your 
conclusion. 


Ans. The test statistic F = has an F distribution with df, = 2 and df, = 27. The critical value, 


F5(2, 27), is between 3.39 and 3.32. The computed value of the test statistic is F* = 3.08. The null 
hypothesis 1s not rejected. 


BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY OF MEANS 


12.33 


Table 12.27 gives the cost in thousands of dollars for randomly selected weddings with approximately 
100 guests for three geographical regions in the U.S. Build a one-way ANOVA and test the null 
hypothesis that the mean costs for such weddings do not differ for the three regions. Test at a 5% level 
of significance. 


Table 12.27 


mean = 16.85 mean = 19.05 


Ans. The Minitab output for the above data is as follows: 


Analysis of Variance 


Source DF SS MS F Pp 
Factor 2 64.55 32.27 6.26 0.007 
Error 21 108.20 5.15 


Total 23 172.75 
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The p value is less than the preset alpha. The mean costs differ for the three regions. Such 
weddings appear to cost less in the South than in the North or West on the average. 


12.34 A sociological study compared the time spent watching TV per week for children ages 2 to 11 for the 
years 1994, 1995, 1996, and 1997. The times in hours per week are shown in Table 12.28. Is there a 
difference in the means for the four years? Test at a 5% level of significance. 


Table 12.28 


Ans. The Minitab output for the one-way ANOVA is shown below. 


Analysis of Variance 


Source DF SS MS F P 
Factor 3 30.1 10.0 0.95 0.421 
Error 76 802.7 10.6 

Total 79 832.8 


The means seem to have decreased since 1994. However, the p value = .421 indicates that the 
means are not different. 


LOGIC BEHIND A TWO-WAY ANOVA 


12.35 Table 12.29 gives the salaries in thousands of dollars for 20 individuals. Half are Nurse Practitioners 
and half are Physician Assistants. In addition, they are classified as practicing in a rural or an urban 
setting. 

Table 12.29 


45, 51, 52, 48, 54 42, 44, 47, 49, 50 

Rural 46, 50, 50, 50, 52 45, 48, 49, 47, 48 
The interaction plots for the data given in Table 12.29 are given in Figs. 12-20 and 12-21. Explain the 
plots in words. 
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Interaction Plot - Means for salary 


PA/NP 
Fig. 12-20 


Interaction Plot - Means for salary 


Fig. 12-21 


Ans. The solid line in Fig. 12-20 corresponds to urban health care workers. The solid line shows that 
urban Physician Assistants in the study had a greater mean salary than urban Nurse Practitioners. 
The dashed line corresponds to rural health care workers. This line shows that the mean salary for 
rural Physician Assistants in the study was also greater than the mean for Nurse Practitioners. The 
solid line in Fig. 12-21 corresponds to Physician Assistants and the dashed line corresponds to 
Nurse Practitioners. This Figure shows that the mean for urban Physician Assistants exceeds the 
mean for rural Physician Assistants. The dashed line shows the opposite to be true for Nurse 
Practitioners. 
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12.36 Table 12.30 gives the results for a 2 x 2 factorial design. The yield in kilograms of okra per plot is 
given for 28 different plots. Each Fertilizer~Moisture combination was applied to 7 different plots and 
the total yield was recorded for each plot. The interaction plot for the data in Table 12.30 is shown in 
Fig. 12-22. Explain, in words, the nature of the interaction. 


Table 12.30 


High Fertilizer 


Low Moisture 2, 3, 4, 5, 6, 4, 4 8, 8, 8, 9, 10, 6, 10 
High Moisture 9,9, 11, 8, 8,9, 8 12, 202, 3,3 


Interaction Plot - Means for yield 


Mean 


mo wo fh OF DN DBD CO 


1 Fertilizer 2 


Fig. 12-22 


Ans. The solid line corresponds to the low level of moisture. This line indicates that at the low level of 
moisture, the mean yield is increased when the level of fertilizer is increased. The dashed line 
corresponds to the high level of moisture. This line indicates that at the high level of moisture, the 
yield is decreased when the level of fertilizer is increased. Whether or not this interaction is 
significant is determined by performing an analysis of variance. 


SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM FOR A TWO-WAY ANOVA 


12.37 In problem 12.35, let factor A be the type of health professional and let the two levels of factor A be 
Physician Assistant and Nurse Practitioner. Let factor B be the population setting and let the two levels 
of factor B be urban and rural. Give the degrees of freedom for the following sources of variation: total, 
A, B, AB, and error. 


Ans. The degrees of freedom for the sources are: total df = 19, Adf= 1, B df= 1, AB df= 1, error df= 
16. 


12.38 In problem 12.36, let factor A be fertilizer level and let the two levels of factor A be low and high. Let 
factor B be the moisture level and let the two levels of factor B be low and high. Give the degrees of 
freedom for the following sources of variation: total, A, B, AB, and error. 
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Ans. The degrees of freedom for the sources are: total df = 27, A df= 1, B df= 1, AB df = 1, error df = 
24. 


BUILDING TWO-WAY ANOVA TABLES 


12,39 Give the general structure for the two-way ANOVA table for the data in Table 12.29. Name the factors 
and levels as given in problem 12.37. 


Ans. The results are given in Table 12.3}. 


Table 12.31 


12.40 Give the general structure for the two-way ANOVA table for the data in Table 12.30. Name the factors 
and levels as given in problem 12.38. 


Ans. The results are given in Table 12.32. 


Table 12.32 


| Source | DE | SS | MSF Statistic | 


[Total PSST 


SAMPLING DISTRIBUTIONS FOR THE TWO-WAY ANOVA 

12.41 Give the critical values for the test statistics F4, Fg, and Fag in problem 12.39. Use a = .05. 
Ans. The critical value for all three is 4.49. 

12.42 Give the critical values for the test statistics F,4, Fg, and Fag in problem 12.40. Use a = .05. 


Ans. The critical value for all three is 4.26. 


TESTING HYPOTHESIS CONCERNING MAIN EFFECTS AND INTERACTION 


12.43 The Minitab output for the data given in Table 12.29 is shown below. After reviewing problems 12.35, 
12.37, 12.39, and 12.41, as well as the below Minitab output, test for main effects as well as interaction 


ata = .05. 

Analysis of Variance for salary 

Source DF SS MS F Pp 
Urban/ru 1 0.450 0.450 0.06 0.812 
PA/NP 1 42.050 42.050 5.44 0.033 
Urban/ru*PA/NP 1 2.450 2.450 0.32 0.581 
Error 16 123.600 7.725 


Total 19 168.550 
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12.44 
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Ans. The p values indicate that at the @ = .05 level of significance, the interaction is not significant. 
There is no significant difference in the mean salaries between urban and rural settings. However, 
the mean salary for Physician Assistants exceeds the mean salary for Nurse Practitioners. 


The Minitab output for the data given in Table 12.30 is shown below. After reviewing Problems 12.36, 
12.38, 12.40, and 12.42, as well as the below Minitab output, test for main effects as well as interaction 


at a= .05. 


Analysis of Variance for yield 


Source DF SS MS F P 
Moisture 1 4.321 4.321 3.18 0.087 
Fertlzer 1 10.321 10.321 7.61 0.011 
Moisture*Fertlizer 1 222.893 222.893 164.24 0.000 
Error 24 32.571 1.357 

Total 27 270.107 


Ans. Because of the significant moisture-fertilizer interaction, the main effects must be interpreted in 
the presence of this interaction. 


Chapter 13 


Regression and Correlation 


STRAIGHT LINES 


Suppose eight individuals, employed at large companies were interviewed, and the number of years 
of service, x, and the number of annual paid days off, y, were determined for each. The results are 
shown in Table 13.1. 


Table 13.1 


A scatter plot for these data is shown in Fig. 13-1. 
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Fig. 13-1 


The data plotted in Fig. 13-1 have a very special property. The points fall perfectly on a straight 
line. The equation of the line is y = 11.0 + 0.5x. The number 11.0 is called the y intercept. This is the 
value of y when x = 0. The number 0.5 is called the slope of the line. The equation may also be 
expressed as y = 0.5x + 11.0. When straight lines are studied in algebra, the slope-intercept form of a 
line is given as y = mx + b. When lines are discussed in statistics, the equation is often written as 


3na 
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y = Bo + Bix. In this form, Bo is the y intercept and B, is the slope. Figure 13-2 shows the line y = 11.0 
+ 0.5x and the points from Table 13.1 on the same graph. 
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Days off 
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10 
0 10 20 30 
Years 
Fig. 13-2 


If the relationship between years of service and annual paid days off for all employees of large 
companies were described by the equation y = 11.0 + 0.5x, then we could predict perfectly the 
number of annual paid days off if we knew the number of years of service. Assuming the equation 
described the relationship perfectly, a person with x = 14 years of service would receive y = 11 + 
.5x 14 = 18 annual paid days off. 


EXAMPLE 13.1 The line y = 1.7 - 4.5x has y intercept 1.7 and slope —4.5. When x = 2, the corresponding y 
value is y = 1.7 —- 4.5(2) = -7.3. That is, the point (2, —7.3) falls on the line whose equation is y = 1.7 — 4.5x. 


LINEAR REGRESSION MODEL 


Table 13.2 contains a more realistic data set than that shown in Table 13.1. In Table 13.2, x 
represents the number of years of service, and y represents the annual paid days off. These data were 
obtained from 20 individuals at a large company. Figure 13-3 is a plot of the data. These points 
clearly do not fall along a straight line. However, the data do indicate a linear trend. That is, a line 
could be fit to the points such that the variation of the points about the line is small. A line is shown 
which provides a good fit to the data. 

A linear regression model assumes that some dependent or response variable, represented by y, 
is related to an independent variable, represented by x, by the relationship shown in formula (/3./). 


y=Bot+Bix +e (13.1) 


The error term, e, is a normally distributed random variable with mean equal to O and standard 
deviation equal to 6. Each value of x determines a population of y values. 
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Table 13.2 
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Fig. 13-3 


The linear regression model, given in formula (/3./), describes the relationship between y and x 
for some population. The intercept, Bo, and the slope, B,, are unknown parameters and must be 
estimated from sample data. The next section discusses the technique used to estimate By and B). 
Table 13.3 contains a summary of the basic properties of the linear regression model. 


Table 13.3 

. The regression model is y = By + Bx + e, where Bo is the y intercept of the line 

given by Bo + Bx, By is the slope of the line given by Bg + B,x, and e is the error 
or deviation of the actual y value from the line given by By + B)x. 


. The error term e is a random variable with a mean of 0, i.e., E(e) = 0. 
. The expected value of y is E(y) = By + Bix. 
. The variance of y equals o” and is the same for all values of x. 


. The values of e are independent. 


oO DA >> Ww NH 


. The error terme is normally distributed. 
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EXAMPLE 13.2 Consider the linear regression model y = 2.5 + 3x + e, where the normal random variable e 
has mean equal to 0 and standard deviation equal to 1.2. The mean response for y when x = 2 is E(y) = 2.5 + 
3(2) = 8.5 and the mean response for y when x = 4 is E(y) = 2.5 + 3(4) = 14.5. In this example we are assuming 
that we know the values for Bp and §,. However, this is not usually the case. We usually have to estimate these 
two parameters from sample data taken from the population. 


LEAST SQUARES LINE 


The line shown in Fig. 13-3 is called the least-squares line. The least-squares line is determined 
such that the sum of the squares of the deviations of the data points from the line is a minimum. For 
point (x;, yi), the deviation from the fitted line. whose equation is represented as y = bo + byx, is 


given by dj = y; — (bo + b;x;). The estimates bo and b; are determined so that the sum of squares of 
deviations about the line is minimized. The sum of squares of deviations is given by 


D= Ya} =  [yi- (bo + dix]? (13.2) 


The estimated y intercept, bo, is also represented by the symbols a and 6, and the estimated slope is 
represented by b and fj,. We shall use bo and b; as the symbols for the estimates of Bo and 8, . Using 


calculus, it may be shown that the estimate for B, is given by 


b, = = (13.3) 
where S,, is given by 
Sas Ix? 2X) (13.4) 
n 
and S,, is given by 
Sy = Exy — PEW (13.5) 
n 
The estimate for B, is given by 
bo= 7 ~bix (13.6) 


EXAMPLE 13.3 The data from Table {3.2 is reproduced below. Recall that x represents the number of years 
of service and y represents the annual paid days off. 
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For these data, xy = 2x 10+2x 14+2x12+---+30x 27+ 30 x 25 + 30 x 26 = 6600. 
Yx=2+2+2+---+30+ 30+ 30 = 304, and x= = 15.2, 

Ly = 10+ 14+12+---+27+25 + 26 = 372, and y = 2 = 186. 

rx =444444--- +900 + 900 + 900 = 6512, Ly’ = 100 + 196 + 144 +--- +729 + 625 + 676 = 7426. 


: Ex)(Z 
Sep = Ex? — 2X)” = 6512 — 4620.8 = 1891.2, Sy = Exy - HP) ~ 6600 - 5654.4 = 945.6. 
n n 
by = St = 2896 05 and by = ¥ —b,X = 18.6—05 x 15.2= 11.0. 
Six 1891.2 


The equation of the line is y = bo + bx = 11.0 + 0.5x. The fitted line is referred to by several 


different names. The fitted line ts called the line of best fit, the least-squares line, the estimated 
regression line, and the prediction line. The computations for bp and b,; are rarely performed by hand 
in practice. Computer software is normally used to find the equation. This is illustrated in Example 
13.4. 


EXAMPLE 13.4 The Minitab procedure for computing the equation of the regression line is shown in Fig. 
13-4. The command Regress 'Daysoff' 1 'Years’ requires that the column containing the dependent variable 
values be named after the word Regress, followed by the number of independent variables, followed by the 
column containing the independent variable values. The regression equation is printed out as the first line of 
output in Fig. 13-4. The remaining output will be discussed in later sections. 


MTB > Regress 'Daysoff' 1 'Years'; 
SUBC> Constant. 


Regression Analysis 


The regression equation is 
Daysoff = 11.0 + 0.500 Years 


Predictor Coef St Dev T P 
Constant 11.0000 0.5703 19.29 0.000 
Years 0.5000 0.0316 15.82 0.000 


1.374 R~Sq = 93.3% R-Sq(adj) = 92.9% 
Analysis of Variance 
Source DF F 
Regression 1 250/31 


Error 18 
Total 19 


Fig. 13-4 


The amount of computation saved by this procedure is hard to appreciate. Before the availability of computer 
software, these computations were done by use of a hand held calculator . 


ERROR SUM OF SQUARES 
The error sum of squares, denoted by SSE, is given by formula (/3.7): 
SSE =X(y, -9,)’ (13.7) 
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The differences, (y; — y;), are called residuals. A residual measures the deviation of an observed 
data point from the estimated regression line. If the estimated regression line fits the data points 
perfectly, as is the case for the data in Table 13.1, then SSE = 0. The more the variability of the data 
points away from the line, the larger the value for SSE. The computation of SSE is illustrated in 
Example 13.5. 


EXAMPLE 13.5 The data in Table 13.2 gives 20 observations for the number of years of service, x, and the 
number of annual paid days off, y. The equation of the estimated regression line is found to be y = bp + bx = 


11.0 + 0.5x in Example 13.3. The computation of the residuals and SSE is shown in Table 13.4. Note that the 
sum of the residuals, L(y; — y;), is equal to zero. The error sum of squares, SSE, is equal to 34. The error sum 


of squares is shown in Fig. 13-4 as part of the Minitab output. It is shown under the Analysis of Variance 
portion of the output and is located at the intersection of the Error row and the SS column. 


zante. 13.4 
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A convenient formula for computing SSE that is equivalent to formula (/3.7) is given in formula 
(/3.8). The computation of S,, is the same as that for S,, using the y values instead of the x values. 


2 
Sxy 


XK 


SSE = Si (/3.8) 


EXAMPLE 13.6 To illustrate the computation of SSE in Example 13.5 using the computation formula, recall 
that in Example 13.3, we found that S,, = [891.2 and S,, = 945.6. The computation of S,, is now illustrated: 


2 2 
Sy = Ly’ - Gy. "= 7426 - 372° = 506.8 and SSE= Gy HS 506s Sega 
20 Sax To 2 
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STANDARD DEVIATION OF ERRORS 


The standard deviation of the error term in the model y = Bo + Bix +e is represented by 6 and the 
variance of the error term is ”. The variance 0° is estimated by S’, where S’ is given by formula 
(13.9): 


_ SSE 


13.9 
rae (13.9) 


S* 


The square root of S” is called the standard deviation of errors, and is given by formula (/3./0): 


SSE 
n-2 


S= (13.10) 


EXAMPLE 13.7 The standard deviation of errors for the data given in Table 13.2 is found by recalling that 
n= 20 and in Example 13.6, we found that SSE = 34. The standard deviation of errors is: 


$= (a= ae cee ey 
n-2  V20-2 


The value for S is shown in the Minitab output given in Fig. 13-4. 


TOTAL SUM OF SQUARES 


Suppose we were requested to estimate the mean number of annual paid days off given by large 
companies, but that we did not know the years of service, x, for each sampled employee. Our best 
estimate would be the mean of the 20 values for y given in Table 13.2. The mean of these 20 values is 
y = 18.6 days per year. The accuracy of the estimate is related to the variation of the individual y 
values about the mean. The sum of squares about the mean is called the total sum of squares, and is 
given by 


2 
SST = E(y, -yyt=zy?- GY 3.11) 
n 


EXAMPLE 13.8 By referring to Example 13.6, it is seen that the total sum of squares is equal to S,y. 
Therefore, from Example 13.6, we see that SST = 506.8. The total sum of squares is given in the Minitab output 
shown in Fig. 13-4. 


REGRESSION SUM OF SQUARES 


When the regression line is not used in estimating the mean annual paid days off, the total sum of 
squares measures the variation about the mean, 18.6, as discussed in the previous section. When the 
regression line is used, there is still some unexplained variation about the regression line. This 
unexplained variation is given by SSE. This implies that the regression line explains an amount of 
variation equal to SST — SSE. This explained variation is called the regression sum of squares. The 
regression sum of squares is given by formula (/3./2): 


SSR = SST - SSE (13.12) 


EXAMPLE 13.9 When the mean, y = 18.6 days per year, is used to estimate the mean number of days off per 
year, the variation of the y values about this mean is SST = 506.8. When the number of years of service is taken 
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a“ 


into account, the estimated regression line was found to be y = 11.0 + 0.5x. There is variation about this line 


that is not explained by the estimated regression line. This unexplained variation is given by SSE = 34. This 
implies that the regression line explained SST — SSE = 506.8 - 34 = 472.8 or 93.3% of the variation of the 
values about the mean. The regression sum of squares is SSR = 472.8. 


The regression sum of squares is directly computable by the formula 
SSR = Ly, - y)’ (13.13) 


EXAMPLE 13.10 Table 13.5 illustrates the computation of the regression sum of squares using formula 
(13.13). Note that SSR = 472.80. In Examples 13.5 and 13.8, we found that SSE = 34.0 and SST = 506.8. We 
see that SST = SSR + SSE. These three sum of squares 1s shown under the SS column of the analysis of 
variance portion of Fig. 13-4. 


Table 13.5 


sum = 472.80 


COEFFICIENT OF DETERMINATION 


The coefficient of determination is defined by 


2_ SSR 


r= —— 13.14 
SST ( ) 


When the data points fall perfectly on a straight line as in Fig. 13-1, SSE = 0 and therefore SST = 
SSR. In this case, the coefficient of determination is equal to 1. When the estimated regression line 
explains none of the variation in y, SSR = 0, and the coefficient of determination is equal to 0. When 
the coefficient of determination is expressed as a percentage, it can be thought of as the percentage of 
the total sum of squares that can be explained using the estimated regression equation. 


EXAMPLE 13.11 For the data in Table 13.2 and discussed in the previous examples, the coefficient of 


472. 
determination is r? = sha x ox 128 x 100 = 93.3%. 
SST 506.8 
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The coefficient of determination may be determined by the use of formula (/3./5): 


2 
pe 2 (13.15) 
S15 5y 


EXAMPLE 13.12 For the data in Table 13.2, it was found in Example 13.3 that S,, = 1891.2 and S,, = 945.6. 
Also in Example 13.6 we found that S,, = 506.8. Therefore, the coefficient of determination is found as follows: 


Z 2 
os SS Se cs DUO 8 FG 04 1% 
Sxx Syy 1891.2 x 5068 


The value for r’ is also given in Fig. 13-4. 


MEAN, STANDARD DEVIATION, AND SAMPLING DISTRIBUTION OF THE 
SLOPE OF THE ESTIMATED REGRESSION EQUATION 


The slope of the estimated regression Jine, bj, is a statistic and has a sampling distribution. That 
is, if several different surveys concerning the number of years of service, x, and the number of annual 
paid days off, y, were conducted, and the estimated regression line found in each case, the values for 
the slope and the y intercept would not all be equal. The estimated slope, b,, has a sampling 
distribution which is normally distributed. The mean of b,; is B, and the standard deviation of b, is 
given in the following: 


10) 


ian ae 


Since © is unknown, the standard deviation of errors is substituted for 6 to obtain the standard error 
for b;. The standard error for b; is given in formula (/3./7): 


(13.16) 


N) 


She 
Sax 


EXAMPLE 13.13 For the data given in Table 13.2, the value for S,, was found to equal 1891.2 in Example 
13.3 and in Example 13.7, the value for S was found to equal 1.374. The standard error for b, 1s therefore equal 


374 ee ee ee ; 
oO Sp, = : = ied = 0.0316. This value is given in the computer output in Fig. 13-4 at the intersection 
Six V¥1891.2 


(13.17) 


of the Years row and the St Dev column. The standard error for b, ts needed to set confidence intervals on and 
test hypothesis concerning the slope of the population regression line. 


INFERENCES CONCERNING THE SLOPE OF THE POPULATION 
REGRESSION LINE 


A (1-@) x 100 % confidence interval for B; is obtained by using the distribution theory given in 
the previous section. The confidence interval is given in formula (/3./8), where tg,2 is obtained from 


the Student t distribution using n — 2 degrees of freedom. 


bit tw2 Sp, (13.18) 
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EXAMPLE 13.14 Consider the continuing Example concerning the number of annual paid days off as a 
function of the number of years of service. Suppose we wish to determine a 95% confidence interval for B,.The 
values for b; and S,, were obtained in Examples 13.3 and 13.13. They may also be obtained from the computer 


printout in Fig. 13-4. The values are b, = 0.5 and §,, = 0.0316. The value for to25 is obtained from the Student t 
table using df = 20 - 2 = 18. The value is to25 = 2.101. The 95% margin of error is + ta S,, = + 2.101 x 0.0316 
= + 0.066. The confidence interval extends from 0.5 — 0.066 = 0.434 to 0.5 + 0.066 = 0.566. 


The steps for testing an hypothesis about the value of the slope of the population regression line 
are given in Table 13.6. 


Table 13.6 


Steps for Testing the Value of the Slope of the Population Regression Line 


Step 1: The null hypothesis is Hp: 8; = Bio and the alternative hypothesis is either lower, 
upper, or two-tailed. 


Step 2: Use the Student t distribution table and the level of significance @ to determine 
the rejection regions. The degrees of freedom is n — 2. 


b, ~ Bia 


bi 


Step 3: The test statistic is computed as t* = 


Step 4: State your conclusions. The null hypothesis is rejected if the computed value of 
the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. 


EXAMPLE 13.15 One of the most often tested hypothesis concerning f, is that it equals zero. This is 
equivalent to testing the hypothesis that x does not determine y and that there is no significant relationship 
between x and y. Suppose we wish to test Hp: 8, = 0 vs. H,: 8, # 0 at significance level @ = .05 using the data in 
Table 13.2. Recall from Example 13.3 that b, = 0.5 and from Example 13.13 that §,, = 0.0316. The value for 
Bio in Table 13.6 is 0, The critical values are determined by finding the t value with .025 area in the right tail 
and df = 18. The values are t 2.101. The computed value of the test statistic is found as follows: 


Since this value falls far beyond the value 2.101, the null hypothesis is rejected. Note that the value 15.82 is 
found in the printout in Fig. 13-4 as the t value corresponding to the predictor years and the corresponding p 
value is given as 0.000. Most researchers would use the information given in the printout to test this hypothesis. 


ESTIMATION AND PREDICTION IN LINEAR REGRESSION 


Suppose we are interested in estimating the mean number of annual paid days off for all 
employees of large companies who have 20 years of service at such companies. Recalling basic 
property 3 in Table 13.3, we have E(y) = Bo + Bix. The mean number of annual paid days off would 
be equal to Bo + 20B;. Since we do not know fo and B, , we would use bo and b, to estimate Bo and 
B,.The point estimate for the mean would be bo + 20b;. The standard error of this estimate is needed 
to set a confidence interval on the mean or to test an hypothesis about this mean. A general 
expression for a confidence interval when predicting the mean response will now be given. 


A 100(1 — @) % confidence interval for the mean response E(y) = Bo + Bi xo is given by formula 
(13.19): 


=\2 
1 G&-% 


S 


xXx 


bo + bixo + t,,.5 (13.19) 
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EXAMPLE 13.16 The data in Table 13.2 will be used to obtain a 95% confidence interval for the mean 
number of annual paid days off for employees having 20 years of service. In Example 13.3, the estimated 
regression equation was found to be ¥ = bp + b)x = 11.0 + 0.5x. The point estimate of the mean is y = 11.0 + 


0.5(20) = 21 days. The standard error of this point estimated is found by recalling from Example 13.3 that x= 
15.2, S,, = 1891.2, and n = 20. In Example 13.7, we found that S = 1.374. Also, x9 = 20. The standard error ts 


equal to 
ay) _ 2 
g fig Bon © 1494 jog COED S040 
n S 20 1891.2 


XR 


For df =n — 2 = 18 degrees of freedom, the student t value is to25 = 2.101. The 95% margin of error is 


—\? 


= xX) 


KX 


= 2.101 x 0.343 = 0.72 


The 95% confidence interval for the mean number of annual days off for employees having 20 years of service 
extends from 21 ~ 0.72 = 20.28 days to 2] + 0.72 = 21.72 days. 


Suppose we wished to predict the number of annual days off for a single individual having 20 
years of service rather than the mean number for all employees having 20 years of service. The 
prediction is still determined by using the estimated regression line as in the case of estimating the 
mean. However, the standard error is larger because a single observation is more uncertain than the 
mean of the population distribution. 

A 100(1 -— @) % prediction interval when predicting a single observation y at x = Xo given by 


formula (/ 3.20): 
ma} 
bo + bix0 + t,,,S p++ Sa (13.20) 
n ax 


EXAMPLE 13.17 A 95% prediction interval for the number of annual days off for an individual having 20 
years of service is found as follows. The point estimate is the same as that found when estimating the mean in 
Example 13.16, 21 years. The standard error associated with the prediction is found as follows: 


I —x)? | I = 15.2)’ 
t.9 ee me ee 1.374 (eg ee PO eh andes 
n Su 20 1891.2 


The margin of error associated with this prediction is 2.101 x 1.42 = 2.98. The prediction interval extends from 
21 — 2.98 = 18.02 days to 21 + 2.98 = 23.98 days. 


EXAMPLE 13.18 In the Minitab output shown in Fig. 13-4, if the subcommand Predict 20; is added to the 
commands, the following additional output is obtained. 


Fit StDev Fit 95.0%. C1 95.0% PI 
21.000 0.343 (20,280). 212720) (132023 23.974.) 


The fit value is the value obtained by substituting 20 into the estimated regression equation. The StDev Fit 
value is the standard error assoctated with estimating the mean. In addition, the 95% confidence interval and the 
95% prediction interval are also given. 


LINEAR CORRELATION COEFFICIENT 


The linear correlation coefficient, also known as simply the correlation coefficient, is a measure 
of the strength of the linear association between two variables. The sample correlation coefficient is 


320 REGRESSION AND CORRELATION [CHAP. 13 


given by formula (/3.2/). The correlation coefficient is also sometimes referred to as the Pearson 
correlation coefficient. 


f= (13.21) 
JSxxSyy 


EXAMPLE 13.19 The correlation coefficient for the data in Table 13.2 will be determined. In Example 13.3 
we found that S,, = 945.6 and S,, = 1891.2. In Example 13.6, we found that S,, = 506.8. The correlation 
coefficient is therefore found as follows. 


Sxy 945.6 


= == = 0,966. 
/Sxx Syy ¥ 1891.2 x 5068 


r= 


The Minitab computation for r is as shown below. The x values are put into column | and the y values are put 
into column 2 and the command corr cl c2 is used to obtain the correlation. 


MTB > corr cl c2 
Correlations (Pearson) 


Correlation of Years and Daysoff = 0.966 


The basic properties of the sample correlation coefficient are given in Table 13.7. 


Table 13.7 


Basic Properties of r 


1. The value of r is always between —-1 and +1, i.e.,-I Se +1. 


2. The magnitude of r indicates the strength of the linear relationship, and the sign of r 
indicates whether the relation is direct or inverse. If r is positive, then y tends to 
increase linearly as x increases, with the tendency being greater the closer that r is to 
1. If r is negative, then y tends to decrease linearly as x increases, with the tendency 


being greater the closer that r is to —-l. If all the points on the scatter diagram lie 
perfectly on a straight line with a positive slope, the r = +1. If all the points on the 
scatter diagram lie perfectly on a straight line with a negative slope, then r = —1. 


3. A value of r close to —1 or +1 represents a strong linear relationship. 


4. A value of r close to 0 represents a weak linear relationship. 


If the formula for r is applied to every pair of (x, y) values in the population, we obtain the population 
correlation coefficient. The Greek letter p represents the population correlation coefficient. 


EXAMPLE 13.20 The data given in Table 13.1 and the corresponding scatter plot shown in Fig. 13-1 are 
reproduced below and on the next page. 
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For the data shown in the table, we have: 


n= 8, Ex = 114, Ly = 145, Ex? = 2316, Ly’ = 2801, and Ixy = 2412. 


(Ex)" (Zy)" 
Sy, = Ex? - —“— = 2316 - 1624.5 = 691.5, S,, = Ly’ — —— = 2801 — 2628.125 = 172.875, and 
n n 
Syy = Zxy — Vey = 2412 — 2066.25 = 345.75. The correlation coefficient is then found as follows: 
n 


Since the points fall exactly on a straight line with a positive slope, the correlation coefficient is equal to 1. 


Sxy | 345.75 


i= OS = 
ySxxSyy V¥6915 x 172.875 
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INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT 


One of the most important uses of the sample correlation coefficient is the determination of 
whether or not a correlation exists between two population variables. Recall that r measures the 
correlation between x and y in the sample and p is the corresponding population correlation. In 
particular, we are often interested in testing the null hypothesis that p = 0 vs. one of three alternatives. 
The alternative p # 0 states that a correlation is present. The alternative p < 0 states that an inverse 
correlation exists, and the alternative p > 0 states that a direct correlation exists. Table 13.8 gives the 
steps for testing a hypothesis concerning p. 


Table 13.8 


Steps for Testing for No Population Correlation 


Step 1: The null hypothesis is Ho: p = 0 and the alternative hypothesis is either 
lower, upper, or two tailed. 


Step 2: Use the Student ¢ distribution table and the level of significance a to 
determine the rejection regions. The degrees of freedom is n - 2. 


—2 


Step 3: The test statistic is computed as t* =r 


Step 4: State your conclusions. The null hypothesis is rejected if the computed 
value of the test statistic falls in the rejection region. Otherwise, the null 
hypothesis is not rejected. 


322 REGRESSION AND CORRELATION [CHAP. 13 


EXAMPLE 13.21 The data in Table 13.2 and the computed value of r in Example 13.19 will be used to test 
whether or not there is a positive correlation between the years of service and the number of annual paid days 
off for employees of large companies. The null hypothesis ts Ho: p = O and the alternative hypothesis is H,: p > 
0. The computed value of the test statistic is: 

-2 20-2 


> = 966) = 15.85 
—T =: 


t*=r 


Because of this large value of t*, we would reject the null and conclude that a positive correlation exists in the 
population. 


Solved Problems 


STRAIGHT LINES 
13.1 Find the slope and y intercept of the following straight lines. 
(a) y=1.5x-2.5 (6) y=3.0-2.5x (c) 2y=4x-6 (d) 16x-32y+8=0 


Ans. In order to find the slope and y intercept, each equation will be put into the form y = mx + b. The 
number m is the slope and the number b is the y intercept. (a) The slope is m = 1.5 and the y 
intercept is b = —2.5. (b) The slope is m = -2.5 and the y intercept is b = 3.0. (c) The given line is 
equivalent to y = 2x — 3 and slope is m = 2 and the y intercept is b = —3. (d) The given line is 
equivalent to y = .5x + .25 and the slope is m = .5 and the y intercept ts b = .25. 


13.2 Determine which of the following points fall on the line y = -2x + 4. 
(a) (2,0) (b) (0,2) (c) (50,96) (d) (-10,24) (e) (14.23, -24.46) 


Ans. The points given in (a), (d), and (e) fall on the line since they satisfy the equation. 


LINEAR REGRESSION MODEL 


13.3 Identify the values of Bo , B; , and o in the linear regression model y = 12.1] + 3.5x +e, where e 
is a normal random with mean 0 and standard 1.2. 


Ans. Bo = 12.1, B, = 3.5 and o = 1.2. 


13.4 Consider the linear regression model y = -1.5x + 5.6 + e, where e is a normal random variable 
with mean 0 and variance o° = 4, Determine the mean and standard deviation of y when x = 2 
and when x = 4.5. 


Ans. Whenx = 2, y =—1.5 (2) +5.6+e=2.6+e. E(y) = E(2.6 + e) = 2.6. Since 2.6 is a constant, the 
variance of y is the same as the variance of e or 4. Therefore the standard deviation of y is 2 when 
x= 2. 
When x = 4.5, y = ~1.5(4.5) + 5.64e=-1.15 +e. E(y) = E(-1.15 + e) =—-1.15. Since -1.15 isa 
constant, the variance of y is the same as the variance of e or 4. Therefore the standard deviation 
of y is 2 when x = 4.5. Notice that the standard deviation of y ts 2, regardless of the value of x. 
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LEAST SQUARES LINE 


13.5 The number of hours spent per week viewing TV, y, and the number of years of education, x, 
were recorded for 10 randomly selected individuals. The results are given in Table 13.9. In 
addition, this table gives computations needed in finding bo and b,. Find the least-squares line 
for these data. 


2 
Ix)(X 
Ans. Sy= Ux?- tee = 2085 — 1988.1=96.9 S,, = Zxy- CVEY 1351 - 1494.6 = -143.6 
Nn nN 
Sw =143.6 a - 
x=141 y=106 b=—= =-].4819 bo= y —b, x= 10.6 - (-1.4819\(14.1 
y 1 Ses 96.9 o= y 1 ( ie ) 


bo = 10.6 + 20.8948 = 31.4948 
The equation of the least-squares line is ¥ = bo + b)x = 31.495 — 1.482x. 


Table 13.9 


13.6 The Minitab output for the data in problem 13.5 is shown in Fig. 13-5. Give the estimated 
regression line. 


MTB > Regress 'TVhours' 1 'Yearsedu'; 
SUBC> Constant; 
SUBC> Predict 15. 


Regression Analysis 
The regression equation is 
TVhours = 31.5 - 1.48 Yearsedu 


Predictor Coef StDev T P 
Constant 31.495 4.388 7.18 0.000 
Yearsedu -1.4819 0.3039 -4.88 0.000 


2.992 R-Sq = 74.8% R-Sq(adj) = 71.7% 


Analysis of Variance 

Source DF MS F P 
Regression 1 : 212.81 23.78 0.000 
Error é 8.95 

Total 


Fit St Dev Fit 95...0%: CI 95.0% PI 
9.266 0985 (oe995.. bleo36) (2.002, 16.531) 


Fig. 13-5 
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Ans. From Fig. 13-5, we see that the estimated regression equation is given in the following portion of 
the output in Fig. 13-5. Note that this is the same equation as the least-squares line found in 


problem 13.5. 

The regression equation is TVhours = 31.5 - 1.48 Yearsedu 

Predictor Coef StDev T P 

Constant 31.495 4.388 7.18 0.000 

Yearsedu -1.4819 0.3039 -4.88 0.000 
ERROR SUM OF SQUARES 


13.7 


13.8 


Compute the error sum of squares for the data in Table 13.9 using both the defining formula 
and the computational formula. 


Ans. Table 13.10 gives the details of the computation of SSE = Zy; ~ yy . The columns from left to 


right give the following information: Individual or subject, Number of years of education, Number 
of hours spent per week watching TV, The fitted or estimated value using the least squares line, 
The residual, and The squares of the residuals. 


Table 13.10 


os ee ac a 


13.7121 ae 13,795 
10.7482 —1.74819 3.0562 
15.1940 —0.19401 0.0376 
7.7843 0.21569 0.0465 
7.7843 —2.78431 7.7524 
4.8204 0.82043 0.6731 
13.7121 6.28793 39.5380 
1.8566 2.14345 4.5944 
16.6760 —0.67595 0.4569 
13.712] 1.28793 1.6588 


Lyi - 9,)=0 | By, -9,)? = 71.593 


2 


*Y From problem 13.5, we have 
RX 


: 
e) 
4 
5 
6 
it 
8 
9 
0 


— 


The error sum of squares may also be computed using SSE = S,,- 


Ly)” 
S,x = 96.9 and S,, = -143.6. In addition, Sy, = tyr = Gy)" = 1408 - 1123.6 = 284.4. Using these values, 
n 


; _ 2 
we find SSE = ee — 284.4 — (-143.6) 


AX 


= 71.593. 


Using the Minitab output shown in Fig. 13-5, locate SSE and compare it with the result found 
in problem 13.7. 


Ans. The error sum of squares 1s shown in bold in the following portion of Fig. 13-5. 


Analysis of Variance 


Source DF ss MS F P 
Regression 1 212.81 212.81 23.78 0.000 
Error 8 71.59 8.95 


Total 9 284.40 
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STANDARD DEVIATION OF ERRORS 


13.9 Use the computed value for SSE in problems 13.7 and 13.8 to find the standard deviation of 
errors, S. 


[7 1.593 
Ans. The standard deviation of errors is given by the formula S = = 2 = 2.9915. 
n 


13.10 Locate the standard deviation of errors in Fig. 13-5 and confirm that it 1s the same as that 
computed in problem 13.9. 


Ans. The following row taken from Fig. 13-5 gives the value for S. 


S = 2.992 R-Sq = 74.8% R~Sq(adj) = 71.7% 


TOTAL SUM OF SQUARES 


13.11 Find the total sum of squares for the data given in Table 13.9. 


1062 


Ans. The total sum of squares is given by SST = X(y, - y)? = Ly’ - = 1408 - = 1408 - 


(Zy)’ 
n 
1123.6 = 284.4. The values for Ly’ and Ly are found at the bottom of Table 13.9. 


13.12 Locate the Total sum of squares in Fig. 13-5 and confirm that it is the same as that found in 
problem 13.11. 


Ans. The following portion of Fig. 13-5 gives the total sum of squares. It is the same as that found in 
problem 13.11. 


Source DF ss MS F P 
Regression 1 212.81 212.81 23.78 0.000 
Error 8 71.59 8.95 

Total 9 284.40 


REGRESSION SUM OF SQUARES 


13.13 Find the regression sum of squares for the data given in Table 13.9 by subtraction as well as 
by direct computation. 


Ans. The regression sum of squares is given by SSR = SST — SSE = 284.4 — 71.593 = 212.807, where 
SST is computed in problems 13.11 and 13.12 and SSE is computed in problems 13.7 and 13.8. 
The direct computation is illustrated in Table 13.11. The mean of the y values is 10.6. 


aanle 13.11 
13.7121 3.11207 


; 10.7482 0.14819 
3 15.1940 4.59401 
4 7.7843 —2.81569 
5 7.7843 —2.81569 
6 4.8204 —5.77957 
7 13.7121 3.11207 
8 1.8566 —8.74345 
9 16.6760 6.07595 
10 13.7121 3.11207 


Sum = 212.81 
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13.14 Locate the regression sum of squares in Fig. 13-5 and confirm that you get the same value as 
that computed in problem 13.13. 


Ans. The regression sum of squares is shown in the following portion of output selected from Fig. 


13-5. 

Source DF ss MS F P 
Regression 1 212.81 212.81 23.78 0.000 
Error 8 14259 8.95 

Total 9 284.40 


COEFFICIENT OF DETERMINATION 


13.15 Calculate the coefficient of determination for the results given in Table 13.9 and explain its 
meaning. 

SSR _ 212.81 

SST 284.4 

regression equation accounts for about 75% of the variation in TV viewing time. 


= 0.748. The estimated 


Ans. The coefficient of determination is given by r’ = 


13.16 Locate the coefficient of determination in Fig. 13-5 and confirm that you get the same value as 
that computed in problem 13.15. 


Ans. Ther’ value shown in the following row, taken from Fig. 13-5, is the same as that in problem 
13.15. 


S = 2.992 R-Sq = 74.8% R-Sq(adj) = 71.7% 


MEAN, STANDARD DEVIATION, AND SAMPLING DISTRIBUTION OF THE 
SLOPE OF THE ESTIMATED REGRESSION EQUATION 


13.17 Find the standard error for b, using the data in Table 13.9. 


Ans. The standard error for b; is given by S,, = — = = = 0.304. The value for S is found in 
Sxx : 


problem 13.9 and the value for S,, is given in problem 13.5. 


13.18 Locate the value for the standard error for b; in Fig. 13-5 and compare it with the value 
computed in problem 13.17. 


Ans. The standard error of b; is shown in bold print in the following selection taken from Fig. 13-5. It 
is the same as that computed in problem 13.17. 


Predictor Coef St Dev T P 
Constant 31.495 4.388 7.18 0.000 
Yearsedu ~1.4819 0.3039 -4.88 0.000 


INFERENCES CONCERNING THE SLOPE OF THE POPULATION 
REGRESSION LINE 


13.19 Use the data in Table 13.9 to test Ho: B, = 0 vs. Ha: B; < 0 at significance level a = .05. 


Ans. First, let us determine the critical value. The degrees of freedom is df = n —- 2 = 8. The t value for 
a right tail area equal to .05 is tos = 1.860. Since this is a lower-tail test, the critical value is 
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bo 
~1.860. The computed value of the test statistic is t* = by Bro From problem 13.5, we know 


bl 


that b; = —1.4819, and from problem 13.17, S,, = 0.304. The hypothesized value for the slope is 


: -1.4819-0 : ; 
Bio = 0. The computed value of the test statistic is therefore t* = = = —4.87. Since this 


value is smaller than —1.860, we reject the null hypothesis and conclude that the slope is less than 
zero. This indicates that the number of hours spent viewing TV and the number of years of 
education are inversely related. 


13.20 Locate the computed value of the test statistic in problem 13.19 in the Minitab output given in 
Fig. 13-5. 


Ans. The following selection from Fig. 13-5 gives the computed test statistic as —4.88 with a 
corresponding p value equal to 0.000. 


Predictor Coef StDev T P 
Constant 31.495 4.388 7.18 0.000 
Yearsedu ~1.4819 0.3039 -4.88 0.000 


ESTIMATION AND PREDICTION IN LINEAR REGRESSION 


13.21 Find a 95% confidence interval for the mean number of hours spent per week watching TV 
for all individuals with 15 years of education and a 95% prediction interval for an individual 
with |5 years of education using the data in Table 13.9. 


(X%) — x)’ 
S 
point estimate for the mean is ¥ = bo + b,(15) = 31.495 — 1.482(15) = 9.265, The margin of 


Ans. The 95% confidence interval for the mean is given by bo + biXg £  ty).8 4 . The 
n 


a 


(x) - x) 


error associated with this estimate is t,,.5,/—+ . The value for to25 is found by using 
n 


8 degrees of freedom to be 2.306. The following values have been found in previous problems: 
S = 2.992, n = 10, x= 14.1, S,, = 96.9, and xg = 15. Therefore, the margin of error is: 


= 2 
2.306 x 2,992 xf 4 Ge=l4D” 
10 96.9 


hours watching TV for all individuals having 15 years of education extends from 9,265 — 2.27 = 
6.995 to 9.265 + 2.27 = 11.535. The 95% prediction interval for an individual is given by 


9 
by + biX9 + ty,.S fie ty Gor 
n S.. 


The point estimate is the same as given above when setting a confidence interval on the mean 

and the t value is the same as above. The margin of error is computed as follows: 
| 15-1411)’ 

2.306 x 2.992 x iO = 7.264. The 95% prediction interval extends from 9.265 

— 7.264 = 2.001 to 9.265 + 7.264 = 16.529. Note that the prediction interval is much wider than 


the confidence interval for the mean. 


= 2,27. The 95% confidence interval for the mean number of 


13.22 Locate the 95% confidence interval and the 95% prediction interval in Fig. 13-5 and compare 
with the results in problem 13.21. 
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Ans. The following portion of the output given in Fig. 13-5 gives the 95% confidence interval and prediction 
interval. The results are the same except for a small amount of round off error. 


Fit St Dev Fit 95.0% CI 95.0% PI 
9.266 0.985 (6.995, 11.538) (2.002, 16.531) 


LINEAR CORRELATION COEFFICIENT 
13.23 Compute the linear correlation coefficient for the data in Table 13.9. 


Sxy 


=== . In problem 13.5, we found 
y¥SxxSyy 


that S,, = 96.9 and S,, = -143.6. In problem 13.7, we found that S,, = 284.4. The correlation 
—143.6 


coefficient is equal to —————————. = -. 865 
V96.9 x 284.4 


13.24 Determine the correlation coefficient using the computer output in Fig. 13-5. 


Ans. The correlation coefficient is given by the formula r = 


Ans. In problem 13.16, we found that r’ = .748. Therefore r= ++.748 = + .865. The negative sign is 


taken since we know that the variables are inversely related. We know they are inversely related 
since the slope of the estimated regression line is negative. Therefore r = —.865. 


INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT 


13.25 Use the data in Table 13.9 to test Ho: p = 0 vs. Hs: p< Oat a= 01. 


Ans. The critical value is determined by noting that df= n - 2 = 10-2 = 8 and that to, = 2.896. Since 
this is a lower-tail test, the critical value is -2.896. The test statistic is computed as 


-—2 : ; F byt 
("= | 7 = —.865 a = —4.876. Since this value is less than the critical value, we 
j-r 1-—.748225 


conclude that the two variables are negatively correlated. 


13.26 A sample of 100 federal taxpayers were interviewed and the annual income and cost to 
prepare their return was determined for each. The correlation coefficient between the two 
variables was found to be 0.37. Do these results indicate a positive correlation between these 
two variables in the population? Use a = .05. 


Ans. Because of the large sample size, we may use the standard normal distribution table to find the 
critical value. The critical value is 1.645. The computed value of the test statistic is t* = 


8631 
value, we conclude that there is a positive population correlation. 


| ~2 8 ; ae 0. 
r aa = 31 ee = 3.94, Since the computed value of the test statistic exceeds the critical 
-r 


Supplementary Problems 


STRAIGHT LINES 


13.27 Figure 13-6 shows the line whose equation is y = 4 + 2x and 8 points, labeled A through H. Points A, C, 
F, and G have the following coordinates: A:(2, 6), C:(2, 10), G:(5, 16), and F(5, 12). Find the 
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coordinates for the points B, D, E, and H. Also find the distance between the following pairs of points: 
A and B, B and C, H and G, and H and F. Find the sum of the squares of the distances between the four 
pairs of points. 


Fig. 13-6 


Ans. B: (2, 8), D: (3, 10), E: (4, 12), and H: (5, 14) 
All four of the distances are equal to 2 units. The sum of the squares of the distances is equal to 
16. 


13.28 Use the coordinates of the points B and H found in problem 13.27 to find the slope of the straight line 
passing through these points and compare this with the slope of the line y = 4 + 2x. 


Ans. The slope of the line is m= 2 ee 2. The slope of the line y = 4 + 2x is m= 2. 


X27 —Xy 5-2 


LINEAR REGRESSION MODEL 


13.29 A linear regression model has a slope of 12.5 and a y intercept equal to 19.2. The error term has a 
standard deviation equal to 2.5. Give the equation of the linear regression model. 


Ans. y=19.2+ 12.5x+e 


13.30 For the linear regression model described in problem 13.29, find the mean values for y when x equals 2 
and 5. 


Ans. 44.2 and 81.7 


LEAST-SQUARES LINE 


13.31 Table 13.12 gives the household income in thousands of dollars, x, and cost of filing federal taxes in 
dollars, y, for 20 randomly selected federal taxpayers. Find the following: Lx, Lx’, Ly, Ly’, Exy, S,,, 
Syy Sy, bi, and bo. 
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Table 13.12 


Ans. Ex = 733.10, Ex? = 31,992, Ly = 1557.50, Ly’ = 153,407, Lxy = 68,864, S,, = 5120.2195, S,, = 
321 16.6875, Say = 11773.8375, by = 2.2995, bo = 6.4132. 


13.32 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the values for by and by. 


MTB > Regress '‘'Cost' 1 ‘Income'; 
SUBC> Predict 40. 


Regression Analysis 


The regression equation is 
Cost = — 6.42 + 2.30 Income 


Predictor T P 
Constant . -0.69 0.501 
Income : 9.83 0.000 


L633 84.3% R-Sq(adj) = 83.4% 


Analysis of Variance 


Source DF SS MS F P 
Regression i: 27076 27076 96.70 0.000 
Error 18 5040 280 

Total 19 32116 


Fit St Dev Fit 95.0%: C1 95.0% PI 
85.57 3.82 (Fie S 5. 93660) (49.50, 121.64) 


Fig. 13-7 


Ans. The estimate regression equation is shown in bold print. The differences between these values 
and those given in problem 13.31 are due to round off error. 


ERROR SUM OF SQUARES 
13.33 Use the results given in problem 13.31 to find SSE for the regression line fit to the data in Table 13.12. 
Siy 


XX 


Ans. SSE= Syy- = 32116.6875 — 27073.6927 = 5042,.9948 


13.34 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the value for SSE. 
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Ans. Error 18 5040 280 
The difference in the values for SSE in problems 13.33 and 13.34 are due to round-off error. 


STANDARD DEVIATION OF ERRORS 


13.35 Use the result found in problem 13.33 to find the standard deviation of errors for the regression line fit 
to the data in Table 13.12. 


Ans. S= Jo = 082.9948 = 16.7382 
naz 18 


13.36 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the value for the standard deviation of errors. 


Ans. 8 = 16.73 R-Sq = 84.3% R-Sq(adj) = 83.4% 


TOTAL SUM OF SQUARES 
13.37 Use the results given in problem 13.31 to find the total sum of squares for the data given in Table 13.12. 
Ans. SST =S,, = 32116.6875 


13.38 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the value for the total sum of squares. 


Ans. Total 19 32116 


REGRESSION SUM OF SQUARES 


13.39 Use the results of problems 13.33 and 13.37 to find the regression sum of squares for the data in Table 
13.12. 


Ans. SSR = SST - SSE = 32116.6875 — 5042.9948 = 27073.6927 


13.40 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the value for the regression sum of squares. 


Ans. Regression 1 27076 27076 96.70 0.000 
The difference in the answers given in problems 13.39 and 13.40 is due to round-off error. 


COEFFICIENT OF DETERMINATION 
13.41 Use the total sum of squares found in problem 13.37 and the regression sum of squares found in 


problem 13.39 to find the coefficient of determination for the linear regression model applied to the data 
in Table 13.12. 


SST = 32116.6875 


13.42 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the value of the coefficient of determination. 


Ans. S = 16.73 R-Sq = 84.3% R-Sq(adj) = 83.4% 
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MEAN, STANDARD DEVIATION, AND SAMPLING DISTRIBUTION OF THE SLOPE OF THE 
ESTIMATED REGRESSION EQUATION 


13.43 Find the standard error for b, , the slope of the estimated regression equation found in problem 13.31. 


S  _ 16.7382 


VSxx -¥5120.2195 


Ans. Sy, = = 0.2339 


13.44 The Minitab output for the regression of y on x for the data in Table {3.12 is shown in Fig. 13-7. Give 
the portion of the output that shows the value standard error for b;. 


Ans. Income 2.2997 0.2339 9.83 0.000 


INFERENCES CONCERNING THE SLOPE OF THE POPULATION REGRESSION LINE 


13.45 Using the data in Table 13.12, test the hypothesis that the slope of the line in the population regression 
model is equal to 0. Use level of significance a = .05. 
Bye Bio 


Ans. The critical values are + 2.101. The computed value of the test statistic is t* = 7 
bt 
2.2995 —0 


3339 = 9.8311. The slope of the regression line is different from 0. 


13.46 Give the portion of Fig. 13-7 that is used to test the nult hypothesis that the slope of the regression line 
is equal to 0. The regression model is y = By + Bix + e, where x = household income in thousands, and 
y = cost of filing federal taxes. Assume a = .01. 


Ans. Predictor Coef St Dev T P 
Constant -6§.420 9.353 -0.69 0.501 
Income 2.2997 0.2339 9.83 0.000 


The computed test statistic is seen to be the same as in problem 13.45. The p value, 0.000, 
indicates that the null hypothesis should be rejected. 


ESTIMATION AND PREDICTION IN LINEAR REGRESSION 


13.47 Find a 95% confidence interval for the mean cost of filing federal income taxes for all those individuals 
with household income equal to $40,000 and a 95% prediction interval for an individual with household 
income equal to $40,000 using the data in Table 13.12. 


(xy — x)? 


ae ] 
Ans. The 95% confidence interval for the mean is given by by + bjXo + t,,.5,fl+—+ . The 
n 


RK 


point estimate of the mean is -6.4132 + 2.2995 x 40 = 85.5668. The margin of error is + 2.101 x 


2 
16.7382 x ae Gages or 8.0336. The 95% confidence interval extends from 85.5668 
20 §120.2195 


~ 8.0336 = $77.53 to 85.5668 + 8.0336 = $93.60. The 95% prediction interval for an individual is 


} sy? 
given by bo + bite £ ta sS eee _ The point estimate for the individual is 85.5668. 
n 


XX 


-36.655)° 
The margin of error is + 2.101 x 16.7382 x (ct Eee or + 36.0729. The 95% 
20 =§120.2195 


prediction interval extends from 85.5668 — 36.0729 = $49.49 to 85.5668 + 36.0729 = $121.64. 
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13.48 


Locate the 95% confidence interval and the 95% prediction interval in Fig. 13-7 and compare with the 
results in problem 13.47. 


Ans. Fit St Dev Fit 95.0% CI 95.0% PI 
85.57 3.82 (77.53, 93.60) (49.50, 121.64) 
The 95% confidence interval and the 95% prediction intervals are seen to be the same as those 
found in problem 13.47. 


LINEAR CORRELATION COEFFICIENT 


13.49 Calculate the correlation coefficient between the household income and the cost of filing federal taxes 


13.50 


using the data in Table 13.12. 


S 11773837 
Ans. The correlation coefficient is given by r= ee LIE, 


——_——— “= (). 


918. 


Using the output shown in Fig. 13-7, determine the correlation coefficient for the data given in Table 
13.12. 


Ans. S$ = 16.73 R-Sq = 84.3% —R-Sq(adj) = 83.4% 
Since r? = .843, r= ¥.843 =0.918. 


INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT 


13.51 


13.52 


Use the computed value for r in problem 13.49 to test the hypothesis Hy: p = 0 vs. H,: p # 0 at level of 
significance @ = .05, where p represents the correlation between household income and the cost of 
filing federal taxes in the population. 


Lode -2 
Ans. The test statistic is computed as t(* =r - 7 = 918 cee = 9.82. The critical values are 
l-r 1-.842724 


+ 2.101. The null hypothesis is rejected at « = .05. 


How can the Minitab output in Fig. 13-7 be used to test the hypothesis Ho: p = 0 vs. H,: p # 0 at level of 
significance & = .05, where p represents the correlation between household income and the cost of 
filing federal taxes in the population. 


Ans. The hypothesis Ho: p = 0 vs. H,: p # 0 is equivalent to the hypothesis Hy: 8; = 0 vs. H,: B; # 0. 
The following portion of the output is used to test Hp: 8; = 0 vs. H,: 8, #0. 


Predictor Coef St Dev T P 
Constant -6.420 9.353 -0.69 0.501 
Income 2.2997 0.2339 9.83 0.000 


The p value = 0.000 indicates that the null hypothesis is rejected. 


Chapter 14 


Nonparametric Statistics 


NONPARAMETRIC METHODS 


Many of the hypotheses tests discussed in previous chapters required various assumptions such as 
normality of the characteristic being measured or equality of population variances in two sample tests 
for example. What can we do if the assumptions are not satisfied? In addition, many research studies 
involve low-level data such as nominal or ordinal data to which previously discussed procedures do 
not apply. In a taste test, the response may be Pepsi or Coke when asked which of two colas are 
preferred. In situations such as these, nonparametric statistical methods are often used. 

Nonparametric tests often replace the raw data with ranks, signs, or both ranks and signs. The 
nonparametric tests then analyze the resulting ranks and/or signs. Since the original data are not 
actually used, one criticism of these procedures is that they are wasteful of information. 


EXAMPLE 14.1 Table 14.1 gives a set of data and the corresponding ranks associated with the data values. 
The smallest number receives a rank of 1, the next a rank of 2, and so on. If two or more values are tied, they 
receive the average of the ranks that would normally occur. Note that 13.5 occurred 3 times and the ranks 
would have been 3, 4, and 5. The average of 3, 4, and 5 is 4 and so 4 was assigned to each occurrence of the 


n(n +1) 
Zz 


value 13.5. The sum of the ranks | through n is always equal to . ifn = 10, then the sum of the ranks is 


10x11 : ; : Ns . . 
= 55. This sum is obtained even if ties occur in the data. If the ranks are added in Table 14.1, the sum 


55 is obtained. The replacement of data with ranks will be uulized in many of the following sections. 


Table 14.1 


EValue | 17.2 [13.5 | 15.5 | 12.5 | 135 [160 | 1s | 13.5 | 14.3 [18.0 | 
Ranke 9 4 810 


EXAMPLE 14.2 Ina taste test, each individual was asked to taste a pizza with a thick crust and a pizza with a 
thin crust, and to state which one they preferred. A coin was flipped to determine which one they tasted first in 
order to eliminate the order of tasting as a factor. Table 14.2 gives the response for each individual and each 
response is coded as + if the response was thick and — if the response was thin. The results seem lo indicate that 
the thin crust is preferred over the thick crust. But are these sample results significant? That is. what do these 
results tell us about the set of preferences in the population? Seventy percent of the sample preferred thin over 
thick crust. What are the chances of these results if in fact there ts no difference tn preference in the population? 
The sign test in the next section will help us answer this question. 


Table 14,2 


| Value {thin | thick | thin [thick [thin | thin [thin | thick [thin | thin | 


Ses ees ES ee a ee ee ee ee ee eee) 


SIGN TEST 


The sign test is one of the simplest and easiest to understand of the nonparametric tests. In 
Example !4.2, the purpose of the taste test is to decide if there is a difference in taste preference 
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between thin and thick pizza crust. The null hypothesis is that there is no difference in preference for 
the two types of crusts. The research hypothesis might be one-tail or two-tail. Suppose it ts a two-tail 
research hypothesis. As usual, we proceed under the assumption that the null hypothesis is true and 
only reject that assumption if our sample results lead us to reject it. If there is no difference in 
preference, then the probability of a + on any trial is .5 and the probability of a — is also .5. The 
number of + signs in the 10 trials, x, is a binomial random variable. The p value associated with the 
outcome x = 3 is computed by finding P(x < 3) and then doubling the result since this is a two-tail 
test. Using the binomial distribution tables in Appendix | with n = 10 and p = .5, we find that P(x $ 3) 
is equal to 0010 + .0098 + .0439 + .1172 = .1719 and the p value = 2 x .1719 = 0.3438. At the 
conventional level of significance @ = .05 we cannot reject the null hypothesis since the p value 1s not 
less than the preselected level of significance. We see from this discussion that the sign test uses the 
binomial distribution to perform the sign test. The p value may be computed by using Minitab. The 
computation using Minitab is as follows: 


MTB > cdf 3; 
SUBC> binomial n = 10 p = .S. 


Cumulative Distribution Function 


Binomial with n = 10 and p = 0.500000 


x P(x <= x) 
3.00 0.1719 


The p value is then found by doubling 0.1719 to get 0.3438 as found above. 


EXAMPLE 14.3 In order to test the hypothesis that the median price for a home in a city is $105,000, a 
sample of 10 recently sold homes is obtained. If the selling price exceeds $105,000, a plus ts recorded. If the 
price is less than $105,000, a negative sign is recorded. If the price is equal to $105,000, the selling price is not 
used in the analysis and the sample size is reduced. Table 14.3 gives the selling prices in thousands and the 
signs are recorded as described. 


Table 14.3 
| Price | 110 {130 [102 [125 {140 | 112 | 17] 130 | 20019 
CEs a RD es (ee ee es 


Suppose the alternative hypothesis is that the median selling price exceeds $105,000. If x represents the number 
of + signs, then the p value = P(x 2 9). The computation of the p value is as follows. 


MTB > cdf 8; 
SUBC> binomial n = 10 p = .5. 


Cumulative Distribution Function 


Binomial with n = 10 and p = 0.500000 


x P(x <= x) 
8.00 0.9893 
Since P(x 29) = 1 — P(x < 8) = f — .9893 = 0.0107, the p value is 0.0107. The null hypothesis is rejected and we 


conclude that the median price exceeds $105,000 at level of significance & = .05. The normal approximation to 
the binomial distribution may be used to determine the p value also. It is recommended that the student review 
this approximation which is discussed in Chapter 6. The approximation is valid provided that np 2 5 and ng 2 5. 
np and nq both equal 5 in this example. The mean is computed as p = np = !0 x .5 = 5 and the standard 
aoe 33 3-5 
deviation is computed as o = Jnpq = ¥10 x 5 x 5 = 1,581. The z value is computed as z = — = 2.21 
The p value is the area to the right of 2.21 under the standard normal curve. P(z > 2.21) = .5 — .4864 = .0136. 
The normal approximation to the binomial distribution provides a reasonably good result if n is 10 or greater. 
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Many books recommend that n be 20 or 25, but if the continuity correction is used the result will be fairly good 
for n equal to 10 or more. 


EXAMPLE 14.4 The sign test described in Example 14.3 may be performed by using the sign test procedure 
in Minitab. The prices given in Table 14.3 are entered in column cl and the following commands are used. The 
subcommand Alternative 1. indicates that the alternative is upper tailed. 


MTB > STest 105 Cl; 
SUBC> Alternative 1. 


Sign Test for Median 


Sign test of median = 105.0 vs. > 105.0 


N Below Equal Above P Median 
Cl 10 1 6) 9 0.0107 122.0 


Note that the same p value is given here as was found in Example 14.3. 


EXAMPLE 14.5 The sign test is also used to test for differences in matched paired experiments. Table 14.4 
gives the blood pressure readings before and six weeks after starting hypertensive medication as well as the 
differences in the readings. 


Table 14.4 
| Patient =] 17] 2/3 [4/15 {[6/7] 8 {19 | 10] 1] 12] 13] 14] 15 | 
| Before reading | 95 | 90 | 98 | 99 | 87 | 95 | 95 | 97 | 90 | 90 | 99 | 95 | 95 | 92 | 96 | 
| After reading | 85 | 80 | 91 | 88 | 90 | 90 | 88 | 84 | 93 | 91 | 90 | 90 | 97 | 94 | 86 | 
| Difference | 10] 10] 7 | it {-3] 5 [7 [13 |-3]-1] 9 | 5 | -2] ~2} 10| 


The research hypothesis is that the medication will reduce the blood pressure. The difference is found by 
subtracting the after reading from the before reading. If the medication is effective, the majority of the 
differences will be positive. If x represents the number of positive signs in the 15 differences and if the null 
hypothesis that the medication has no effect is true, then x will have a binomial distribution with n= 15 and p = 
.5. The null hypothesis should be rejected for large values of x. That is, as this study is described, this is an 
upper-tail test. The Minitab solution is as follows. If the medication is not effective, the median difference 
should be zero. The differences are entered into column I. 


10 10 q 11 ~3 5 7 i -3 =i 9 5 
-2 -2 10 


MTB > STest median = 0.0 data in Cl; 
SUBC> Alternative 1. 


Sign Test for Median 
Sign test of median = 0.00000 vs. > 0.00000 

N Below Equal Above P Median 
cl 15 5 0 10 0.1509 7.000 


The p value is seen to equal 0.1509. This indicates that the null hypothesis would not be rejected at level of 
significance o& = .05. A dotplot of the differences is shown in Fig. 14-1. The dotplot illustrates a weakness of the 
sign test. Note that the 5 minus values are much smaller in absolute value than are the absolute values of the 10 
positive differences. The sign test does not utilize this additional information. The signed-rank test discussed in 
the next section uses this additional information. 


-+----~----- +--~------- picewte Ssyose ------ PARE ents Saree 
-3.0 0.0 3.0 6.0 9.0 12.0 


Fig. 14-1 
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The sign test is a nonparametric alternative to the one sample t test. The one sample t test 
assumes that the characteristic being measured has a normal distribution. The sign test does not 
require this assumption. If the selling prices of the homes discussed in Example 14.3 are normally 
distributed, then the one sample t test could be used to test the null hypothesis that the mean is 
$105,000. 


WILCOXON SIGNED-RANK TEST FOR TWO DEPENDENT SAMPLES 


In Example 14.5, it was noted that the sign test did not make use of the information that the 10 
positive differences were all larger in absolute value than the absolute value of any of the 5 negative 
differences. The sign test used only the information that there were 10 pluses and 5 negatives. The 
data in Table 14.4 will be used to describe the Wilcoxon signed-rank test. Table 14.5 gives the 
computations needed for the Wilcoxon signed-rank test. 


Table 14.5 


Before reading Difference,D |  ={D{ ‘| Rankof|D Signed rank 
95 85 10 10 12 12 


3 
5 
a 
3 
3 
I 
9 
5 
2 
Z 
0 


os 


The first column gives the before blood pressure readings, the second column gives the after 
readings, the third column gives the difference D = Before — After, the fourth column gives the 
absolute values of the differences, the fifth column gives the ranks of the absolute differences, and the 
sixth column gives the signed-rank which restores the sign of the difference to the rank of the 
absolute difference. Two sum of ranks are defined as follows: W* = sum of positive ranks and W~ = 
absolute value of sum of negative ranks. For the data in Table 14.5, we have the following: 


W= 124+ 1248.54 144+6.54+8.5+4 15+ 10+6.5 +412 = 105 
W =4.54+454+142542.5= 15 


To help in understanding the logic behind the signed-rank test, it is helpful to note some interesting 
facts concerning the above data. The sum of the ranks of the absolute differences must equal the sum 
of the first 15 positive integers. Using the formula given in Example 14.1, the sum of the ranks 1 


through 15 is Lae = 120. Note that W* + W- = 120. If the null hypothesis is true and the 


medication has no effect, we would expect a 60, 60 split in the two sum of ranks. That is if Ho is true, 
we would expect to find W* = 60 and W = 60. If the medication is effective for every patient, we 
would expect W* = 120 and W = 0. Either of the rank sums, W" or W’, may be used to perform the 
hypothesis test. Tables for the Wilcoxon test statistic are available for determining critical values for 
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the Wilcoxon signed-rank test. However, statistical software would most often be used to perform the 
test in real world applications. The Minitab solution is given in Example [4.6. 


EXAMPLE 14.6 The differences in column 3 of Table 14.5 are entered into column cl. The command 
Wrest 0.0 Ci; indicates that the Wilcoxon signed-rank test is to be used to test that the median of the 
differences is 0 and that the differences are in column cl. The subcommand Alternative 1. indicates that 
the test is upper-tailed. The output gives the Wilcoxon statistic, W*, computed in the above discussion and most 
importantly it gives the p value for the test. The probability of obtaining a value of W* equal to 105 or larger is 
0.006. This p value ts much less than .05S and indicates that the medication is effective. Note that the p value 
obtained when using the sign test in Example 14.5 is 0.1509. The Wilcoxon signed-rank test is more powerful 
than the sign test as indicated by this example. 


Diff 
10 10 7 aay = 3 ' 7 13 =2 1 
S) 5 =e aes 10 


MTB > WTest 0.0 Cl; 
SUBC> Alternative 1. 


Wilcoxon Signed Rank Test 


Test of median = 0.000000 vs. median > 0.000000 


N for Wilcoxon Estimated 
N Test Statistic P Median 
Diff 15 15 105.0 0.006 5.000 


A normal approximation procedure is also used for the Wilcoxon signed-rank test procedure 
when the sample size is !5 or more. The test statistic is given by formula (/4./), where W ts etther 
W* or W depending on the nature of the alternative hypothesis, and n is the number of nonzero 
differences. 


W -n(n+1)/4 


n(n + 1)(2n + 1)/24 


EXAMPLE 14.7 The normal approximation will be applied to the blood pressure data in Table 14.5 and the 
results compared to the results in Example 14.6. The computed test statistic is as follows: 


(14.1) 


_ 105-15(16)/4 _ 


Z= 
Vf 15(16)(3 1) /24 


The p value is P(z > 2.56) = .5 — .4948 = 0.0052. The p value in Example 14.6 is equal to 0.006. Note that the 
two values are very close. The difference may be due to round-off error or it may be due the Minitab software 
using the exact distribution for the test statistic rather than the normal approximation. 


Like the sign test, the Wilcoxon signe rank test is a nonparametric alternative to the one sample 
t test. However, the Wilcoxon signed-rank test is usually more powerful than the sign test. There are 
situations such as taste preference tests where the sign test is applicable but the Wilcoxon signed-rank 
test is not. 


WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES 


Consider an experiment designed to compare the lifetimes of rats restricted to two different diets. 
Eight rats were randomly divided into two groups of four each. The rats in the control group were 
allowed to consume all the food they desired and the rats in the experimental group were allowed 
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only 80% of the food that they normally consumed. The experiment was continued until all rats 
expired and their lifetimes are given in Table 14.6. 


Table 14.6 


A nonparametric procedure for comparing the two groups was developed by Wilcoxon. Mann 
and Whitney developed an equivalent but slightly different nonparametric procedure. Some books 
discuss the Wilcoxon rank-sum test and some books discuss the Mann-Whitney U test. Both 
procedures make use of a ranking procedure we will now describe. Imagine that the data values are 
combined together and ranked. Table 14.7 gives the same data as shown in Table 14.6 along with the 
combined rankings shown in parentheses. 


Table 14.7 


4.2 (6.5 4.2 (6.5 


We let R; be the sum of ranks for group 1, the control group, and R» be the sum of ranks for 
group 2. For the data in Table 14.7, we have R; = 11 and R2 = 25. The sum of all eight ranks must 


2 = 36. Note that R, + R> = 36. 


equal 


Now, consider some hypothetical situations. Suppose there is no difference in the lifetimes of the 
control and experimental diets. We would expect R,; = 18 and R2 = 18. Suppose the experimental diet 
is highly superior to the control diet, then we might expect all rats in the experimental group to be 
alive after all the rats in the control diet have died, and as a result obtain R, = 10 and R2 = 26. Critical 
values for the testing the hypothesis of no difference in the two groups may be obtained from a table 
of the Mann-Whitney statistic or a table of the Wilcoxon rank-sum statistic. These tables are available 
in most elementary statistics texts. Most users of Statistics will use a computer software package to 
evaluate their results. Example 14.8 illustrates the use of Minitab to evaluate the results in the 
experiment described above. 


EXAMPLE 14.8 The Minitab analysis for the data in Table 14.6 is shown below. Note that the data are stored 
in separate columns. The command Mann-Whitney 95.0 C1 C2; will perform a Mann-Whitney test on 
the data in columns cl and c2. The subecommand Alternative -1. will test the alternative that the median 
lifetime for the control group is less than the median lifetime for the experimental group. The output line W = 
IL is the value for Rj; computed in the above discussion of this experiment. The output line Test of ETA1 
= ETA2 vs. ETA1l < ETA2 is significant at 0.0303 gives the p value as 0.0303 for the 
lower-tail alternative hypothesis. At level of significance @ = .05, the null hypothesis would be rejected since 
the p value < a. This rather small study seems to indicate that restricting the diet prolongs the lifetime of such 
rats. 


Row Control Expermtl 


mWN er 
NO of Wo Wo 
None 
Dm BP Wo 
Men ~J 


MTB > Mann-Whitney 95.0 Cl C2; 
SUBC> Alternative -1. 
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Mann-Whitney Confidence Interval and Test 


Control N = 4 Median = 3.500 

Expermt1 N = 4 Median = 4.200 

Point estimate for ETA1-ETA2 is -0.700 

97.0 Percent CI for ETAI1-ETA2 is (-2.000,0.300) 

Wwe 11.0 

Test of ETA1 = ETA2 vs. ETA1 < ETA2 is significant at 0.0303 
The test is significant at 0.0295 (adjusted for ties) 


A normal approximation procedure is also used for the Wilcoxon rank-sum test procedure when the 
two sample sizes are large. If both sample sizes, n, and n2, are 10 or more, the approximation is 
usually good. The test statistic is given by formula (/4.2) where R, is the rank-sum associated with 
sample size nj. 


_ R,-n,(n, +n, +1)/2 
Vn,nj(n, +n, +1)/12 


EXAMPLE 14.9 A test instrument which measures the level of type A personality behavior was administered 
to a group of lawyers and a group of doctors. The results and the combined rankings are given in Table 14.8. 
The higher the score on the instrument, the higher the level of type A behavior. The null hypothesis is that the 
medians are equal for the two groups, and the alternative hypothesis is that the medians are unequal. 


Zz (14.2) 


Table 14.8 


We note that n, = 12, n; = 10, and R, = 184. The computed value of the test statistic is found as follows; 


« 184 -12012+104+1)/2 


~ Y12 10x (12 + 10+ 1) /12 


The null hypothesis is rejected for level of significance @ = .05, since the critical values are + 1.96 and the 
computed value of the test statistic exceeds 1.96. The median score for lawyers exceeds the median score for 
doctors. 


= 3.03 


In closing this section, we note that the parametric equivalent test for the Wilcoxon rank-sum test 
and the Mann-Whitney U test is the two-sample t test discussed in Chapter 10. 


KRUSKAL-WALLIS TEST 


In Chapter 12, the one-way ANOVA was discussed as a statistical procedure for comparing 
several populations in terms of means. The procedure assumed that the samples were obtained from 
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normally distributed populations with equal variances. The Kruskal-Wallis test is a nonparametric 
statistical method for comparing several populations. Consider a study designed to compare the net 
worth of three different groups of senior citizens (ages 51— 61). The net worth (in thousands) for each 
senior citizen in the study is shown in Table 14.9. The null hypothesis is that the distributions of net 
worths are the same for the three groups and the alternative hypothesis is that the distributions are 
different. 


Table 14.9 
80 70 
150 100 
70 130 
90 75 


95 


The data are combined into one group of 15 numbers and ranked. The rankings are shown in Table 
14.10. The sums of the ranks for each of the three samples are shown as Ry, R2, and R3. 


Table 14.10 

Hispanics 

15 4 1.5 

12 11 7 

9.5 1.5 9.5 

13 5 3 

14 8 6 
R= 298 


The sum of the ranks from | to 15 is equal to 120. If the null hypothesis were true, we would expect 
the sum of ranks for each of the three groups to equal 40. That is, if Ho is trae we would expect that 
R; = R2 = R; = 40. The Kruskal-Wallis test statistic measures the extent to which the rank sums for 
the samples differs from what would be expected if the null were true. The Kruskal-Wallis test 
Statistic is given by formula (/4.3), where k = the number of populations, n;= the number of items in 
sample i, Rj = sum of ranks for sample i, and n = the total number of items in all samples. 


2 
w= —2 EB]-aep (14.3) 
n(n+)] 1 ni 


W has an approximate Chi-square distribution with df = k — 1 when all sample sizes are 5 or more. 


EXAMPLE 14.10 The computed value for W for the data in Table 14.9 is determined as follows: 


*— 


~ 15x16 


2 2 2 
Lo | Oa n  29T LT | 3x16 = 8.315 
5 5 


The critical value from the Chi-square table with df = 2 and a = .05 is 5.991. The population distributions are 
not the same, since W* exceeds the critical value. Suppose the sample rank sums were each the same. That is, 
suppose R; = R2 = R; = 40. The computed value for W would then equal the following: 


12 2 40? 40? 


= +—+ -3x16=-0 
1Sx16|] § 5 5 


It is clear that this is always a one-tailed test. That is, the null is rejected only for large values of the computed 
test statistic. 
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EXAMPLE 14.11 The Minitab analysis for the data in Table 14.9 is given below. The value for H is the same 
as that for W* found in Example 14.10. The p value is 0.016, indicating that the null hypothesis would be 


rejected for a = .0S, the same conclusion reached in Example 14.10. 


Row Group Networth 
1 1 250 
2 1 175 
3 1 130 
4 1 205 
5 1 225 
6 2 80 
7 2 150 
8 2 70 
9 2 90 

10 2 110 

id. 3 70 

12 3 100 

13 3 130 

14 3 75 

15 3 95 


MTB > Kruskal-Wallis ‘Networth' 'Group'. 
Kruskal—Wallis Test 


Kruskal-Wallis Test on Networth 


Group N Median Ave Rank Z 
1 5 205.00 EOP 2.88 
2 5 90.00 eee) -1.29 
3 5 95.00 5.4 -1.59 
Overall 15 8.0 


H= 8.31 DF = 2 P # 0.016 


RANK CORRELATION 


The Pearson correlation coefficient was introduced in chapter 13 as a measure of the strength of 
the linear relationship of two variables. The Pearson correlation coefficient is defined by formula 
(14.4). The Spearman rank correlation coefficient is computed by replacing the observations for x 
and y by their ranks and then applying formula (/4.4) to the ranks. The Spearman rank correlation 
coefficient is represented by r,. 


r= 4 (14.4) 


SxxSyy 


Table 14.11 


Basic Properties of the Spearman Rank Correlation Coefficient 


[. The value of r, is always between -1 and +1, i.e.,-1 <r, < +1 


2. A value of r, near +1 indicates a strong positive association between the rankings. 
A value of r, near —1 indicates a strong negative association between the rankings. 


3. The Spearman rank correlation coefficient may be applied to data at the ordinal 
level of measurement or above. 


EXAMPLE 14.12 Table 14.12 gives the number of tornadoes and number of deaths due to tornadoes per 
month for a sample of 10 months. In addition, the ranks are shown for the values for each variable. 
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Table 14.12 


Number of Number of 
tornadoes, x Rank of x deaths, Rank of 


oo 


CBmMmomewnnrnast — 


Jonhb BwWOo—- WN 


— 


The formula for the Pearson correlation coefficient given in formula (/4.4) is applied to the ranks of x and the 
ranks of y. 


Dx=14+44+6+---+10=55, Dx?= 14 164+ 364+--+ + 100 = 385, Ly=55, Ly =44+44+254+---+ 100= 
382.5, Xxy=1x2+4x2+6x5+---+10x 10= 363.5. 


_ (2x)(Zy) 


San = Dx? - = 382.5 ~ 302.5 = 80, and S,, = Ixy = 
n 


363.5 — 302.5 = 61. 


2 
(Xx) = 385 — 302.5 = 82.5, S,, = Ly’ 
n 


= (Zy)” 
n 


Sxy 61 


SxxSyy  ¥82.5x80 


R, = 


EXAMPLE 14.13 The Minitab solution for Example 14.12 is shown below. The raw data are entered into 
columns cl and c2. The rank command is then used to put the ranks into columns c3 and c4. Then the Pearson 
correlation coefficient is requested for columns c3 and c4. 


tornado deaths 
35 ) 
85 
90 
94 
50 
76 
98 
104 
88 

10 128 


R 


= 


WODMANHAUFWNEFRO 


SOP BWOrRN CO 


MTB > rank cl put into c3 
MTB > rank c2 put into c4 
MTB > print c3 c4 


C3 c4 
1 


R 


= 


WWDWATINHMNBPWNHr O 
WIDINN HUD 
COMUMNDOOOOCO 


6 
7 
2 
| 
8 
9 
5 
10 10 10. 


MTB > correlation c3 c4 
Correlation of c3 and c4 = 0.739 
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The population Spearman rank correlation coefficient is represented by p,. In order to test the 
null hypothesis that p, = 0, we use the result that ¥n—Ir, has an approximate standard normal 


distribution when the null hypothesis is true and n 2 10. Some books suggest a larger value of n in 
order to use the normal approximation. 


EXAMPLE 14.14 The data in Table 14.12 may be used to test the null hypothesis Hg: p, = 0 vs. H,: p, # 0 at 
level of significance @ = .05S. The critical values are + 1.96. The computed value of the test statistic is found as 
follows: z* = ¥IQ-—1x .75 = 2.25. We conclude that there is a positive association between the number of 
tornadoes per month and the number of tornado deaths per month since z* > 1.96. 


RUNS TEST FOR RANDOMNESS 


Suppose 3.5-ounce bars of soap are being sampled and their true weights determined. If the 
weight is above 3.5 ounces, then the letter A is recorded. If the weight is below 3.5 ounces, then a B 
is recorded. Suppose that in the last 10 bars the result was AAAAABBBBB. This would indicate 
nonrandomness in the variation about the 3.5 ounces. A result such as ABABABABAB would also 
indicate nonrandomness. The first outcome is said to have R = 2 runs. The second outcome is said to 
have R = 10 runs. There are n; = 5 A’s and n. = 5 B’s. A run is a sequence of the same letter written 
one or more times. There is a different letter (or no letter) before and after the sequence. A very large 
or very small number of runs would imply nonrandomness. 

In general, suppose there are n, symbols of one type and nz symbols of a second type and R runs 
in the sequence of n, + nz symbols. When the occurrence of the symbols is random, the mean value 
of R ts given by formula (/4.5): 

2n\n 


pect (14.5) 
nn, +n 


When the occurrence of the symbols is random, the standard deviation of R is given by formula 


(74.6): 
o= [2nyng(2mpng ~ 0 — M2) (14.6) 
(ny + nz)" (n, +n2-1) 


When both n, and n are greater than 10, R is approximately normally distributed. The size 
requirements for n; and n2 in order that R be normally distributed vary from book to book. We shall 
use the requirement that both exceed 10. From the proceeding discussion, we conclude that when the 
occurrence of the symbols is random and both n, and n2 exceed 10, then the expression in formula 
(14.7) has a standard normal distribution. 


Z= (14.7) 


When the normal approximation is not appropriate, several Statistics books give a table of critical 
values for the runs test. 


EXAMPLE 14.15 Bars of soap are selected from the production line and the letter A is recorded if the weight 
of the bar exceeds 3.5 ounces and B is recorded if the weight is below 3.5 ounces. The following sequence was 
obtained. 


AAABBAAAABBABABAAABBBBAABBBABABBAA 
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These results are used to test the null hypothesis that the occurrence of A’s and B’s is random vs. the alternative 
hypothesis that the occurrences are not random. The level of significance is @ = .01. The critical values are 
+2.58. We note that n, = the number of A’s = 18 and n2 = the number of B’s = 16. The number of runs is R = 
2 

nynz ogee 2x 18x16 
ny) +n 34 


17. Assuming the null hypothesis is true, the mean number of runs is p = ie 


17.9411. Assuming the null hypothesis is true, the standard deviation of the number of runs is : 


a Pning(?nzn2 O12) PAIS RIO ON = 2.8607 
(n, +n)" (ny +n2—-1) 34° x 33 
R-p — 17-17.9411 _ 
Oo 2.8607 


There is no reason to doubt the randomness of the observations. The process seems to be varying randomly 
about the 3.5 ounces, the target value for the bars of soap. 


The computed value of the test statistic is Z* = -.33. 


EXAMPLE 14.16 The Minitab solution to Example 14.15 is shown below. The symbols A and B are coded 
as 1 and —1 respectively. The 0. in the statement Runs 0 C1, is the mean of -1 and |. The output gives the 
observed number of runs, the mean number of runs, the number of A’s and B’s, and the p value = 0.7422. The 
output is the same as that obtained in Example 14.15. 


ont 
1 1 1 -1 -1 1 al 1 1 -1 -1 1 a 1 =1 
1 a ul ed, a | ai =. 1 1 al wks cell di sae | 1 
=I = 1 1 


MTB > Runs 0 Cl. 


Runs Test 
The observed number of runs = 17 
The expected number of runs = 17.9412 


18 Observations above K 16 below 
The test is significant at 0.7422 
Cannot reject at alpha = 0.05 


Solved Problems 


NONPARAMETRIC METHODS 


14.1. A table (not shown) gives the number of students per computer for each of the 50 states and the 
District of Columbia. If the data values in the table are replaced by their ranks, what is the sum 
of the ranks? 


Ans. The ranks will range from | to 51. The sum of the integers from 1 to 51 inclusive is ee = 


1,326. This same sum is obtained even if some of the ranks are tied ranks. 


14.2 A group of 30 individuals are asked to taste a low fat yogurt and one that is not low fat, and to 
indicate their preference. The participants are unaware which one is low fat and which one is 
not. A plus sign is recorded if the low fat yogurt is preferred and a minus sign is recorded if the 
one that is not low fat is preferred. Assuming there is no difference in taste for the two types of 
yogurt, what is the probability that all 30 prefer the low fat yogurt? 


Ans. Under the assumption that there is no difference in taste, the plus and minus signs are randomly 
assigned. There are 2°” = 1,073,741,824 different arrangements of the 30 plus and minus signs. 
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The probability that all 30 are plus signs is 1 divided by 1,073,741,824, or 9.313225746 x 10°”. If 
this outcome occurred, we would surely reject the hypothesis of no difference in the taste of the 
two types of yogurt. 

SIGN TEST 


14.3. Find the probability that 20 or more prefer the low fat yogurt in problem 14.2 if there is no 
difference in the taste. Use the normal approximation to the binomial distribution to determine 
your answer. 


Ans. 


Assuming that there is no difference in taste, the 30 responses represent 30 independent trials with 
p = q=.5, where p is the probability of a + and q is the probability of a -. The mean number of + 


signs is {t= np = 15 and the standard deviation is 6 = Vnpq = 2.739. The z value corresponding 


19.5-15 


2.739 
by P (z > 1.64) = .5 — .4495 = .0505. 


to 20 plus signs is z = = 1.64. The probability of 20 or more plus signs is approximated 


14.4 A sociological study involving married couples, where both husband and wife worked full 
time, recorded the yearly income for each in thousands of dollars. The results are shown in 
Table 14.13. Use the sign test to test the research hypothesis that husbands have higher salaries. 


Use level of significance @ = .05. 


Ans. 


ee 
| Husband _| 35 | 16 | 17 | 25 | 30 | 32 | 28 [31 | 27 | 15 | 19 | 22 | 33 | 30 | 18 | 
[Wife | 25 | 20 | 18 | 20 | 25 | 25 | 25 | 19 | 30 | 20 | 15 | 20 | 27 | 28 | 20 | 
[_ Difference [io] -4]-1) s{ 5} 7) 3 [i12{-3]-5] 4) 2) 6] 2] -2 | 


Table 14.13 


A high number of plus signs among the differences supports the research hypothesis. There are 10 
plus signs in the 15 differences. The p value is equal to the probability of 10 or more plus signs in 
the 15 differences. The p value is computed assuming the probability of a plus sign is .5 for any 
one of the differences, since the p value is computed assuming the null hypothesis is true. From 
the binomial distribution, we find that the p value = .0916 + .0417 + .0139 + .0032 + .0005' + 
.0000 = 0.1509. Since the p value exceeds the preset level of significance, we cannot reject the 
null hypothesis. 


WILCOXON SIGNED-RANK TEST FOR TWO DEPENDENT SAMPLES 


14.5 Use the Wilcoxon signed-rank test and the data in Table 14.13 to test the research hypothesis 
that husbands have higher salaries. Determine the values for W* and W~ and then use the 


normal approximation procedure to perform the test. Use level of significance @ = .05 


Ans. 


From Table 14.14, we see that W* = 15 + 10+ 104+ 13+5.5 + 14+75+3+ 12+ 3=93, and 
W =7.5 + 145.5 + 10 + 3 = 27. Since Difference = Husband salary - Wife salary, large values 
for W* or small values for W will lend support to the research hypothesis that husbands have 
higher salaries. If W* is used, the test will be upper-tailed. If W~ is used, the test will be lower- 
tailed. The test statistic used in the normal approximation is: 


W-n(n+l/4 | W-60 


Le C= 
yn(n + 1)(2n + 1)/24 17.607 
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Table 14.14 


| Husband | Wife | Difference,D | |D|_ | Rank of | D|_| Signed rank 
35 25 10 15 


0 
4 
] 
5 
5 
7 
3 
2 
3 
5 
4 
2 
6 
2 
2 
= 27-60 % 
If we use W, then Z = 17607 = —1.87, and the p value = .5 — .4693 = .0307. If we use W’, then Z = 
93-60 : ; + “ : 
aor = 1.87, and the p value = .0307. It is clear that we may use either W" or W . In either case, we 


reject the null hypothesis since the p value < .05. 


14.6 The Minitab output for the solution to problem 14.5 is shown below. Answer the following 
questions by referring to this output. 
(a) Explain the various parts of the command Wrest 0.0 'Diff'; 
(b) Why was the subcommand Alternative 1. used? 
(c) What is the number 93.0 shown under Wilcoxon statistic? 
(d) How does the p value given in the Minitab output compare with the p value found in 
problem 14.5? 


MTB > print cl 


Data Display 
Diff 
10 -4 a]; 5 5 7 3 12 -3 -5 
4 2 6 2 ~2 


MTB > West 0.0 'Diff'; 
SUBC> Alternative 1. 


Wilcoxon Signed Rank Test 
Test of median = 0.000000 vs. median > 0.000000 


N for Wilcoxon Estimated 
N Test Statistic Pp Median 
Diff 15 15 93.0 0.032 2.750 


Ans. (a) Wlest indicates that the Wilcoxon signed-rank test is to be performed. The value 0.0 indicates 
that we are testing that the median difference is assumed to be 0.0. The median difference will 
be 0.0 if the null hypothesis is true. Diff indicates that we are performing the test for the 
differences in column I. 

(b) The subcommand Alternative 1. indicates an upper-tailed test. Since Diff = Husband 
salary ~ Wife salary, a positive median will support the research hypothesis that the husband 
salaries are greater than the wife salaries. 

(c) The Wilcoxon statistic is the same as W* found in problem 14.5. 
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(d) The p values are very close. The p value based on the normal approximation is 0.0307 and the 
Minitab p value is 0.032. 


WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES 


14.7. A study compared the household giving for two different Protestant denominations. Table 


14.15 


gives the yearly household giving in hundreds of dollars for individuals in both samples. 


The ranks are shown in parentheses beside the household giving. Use the normal 
approximation to the Wilcoxon rank-sum test statistic to test the research hypothesis that the 
household giving differs for the two denominations. Use level of significance @ = .01. 


Ans. 


Table 14.15 
9.3 (5) 7.0 (1) 
13.5 (20) 9.0 (4) 
13.7 (21) 9.7 (8) 
10.2 (10) 10.4 (11) 
10.0 (9) 10.9 (13) 
10.7 (12) 8.0 (2) 


11.3 (15) 8.5 (3) 

11.4 (16) 9.5 (6.5) 
11.5 (17) 9.5 (6.5) 
12.3 (18) 11.114) 
14.2 (22) 14.5 (23) 


The sample sizes are nj = 12 and n) = 1}. The rank sum associated with sample | is Ry = 184. The 
test statistic is computed as follows: 


Ry—ny(myp tng +I /2 184-1224 11+1)/2 


= — = 2.46 
¥Qyno(ny+nz +1)/12 yl2x11x(12+114+1)/12 


The critical values are £2.58. Since the computed value of the test statistic does not exceed the 
right side critical value, the null hypothesis is not rejected. The p value is computed as follows. 
The area to the right of 2.46 is .5 — .4931 = .0069, and since the alternative is two sided, the p 
value = 2 x .0069 = 0.0138. If the p value approach to testing is used, we do not reject since the p 
value is not less than the preset a. 


Z= 


14.8 The Minitab output for the solution to problem 14.7 is shown below. Answer the following 


questi 


ons by referring to this output. 


(a) Explain the command Mann-Whitney 'Denom1' ‘'Denom2'; 

(b) What does the subcommand Alternative 0. mean? 

(c) What is the line W = 184 giving you? 

(d) Explain the line Test of ETA] = ETA2 vs. ETA1 not = ETA2 is significant at 0.0151. 


Row 


AUP WNP 


Denoml Denom2 Row Denom1 Denom2 
9.3 7.0 7 1Va.3 8.5 
13.5 9.0 8 11.4 9.5 
13.7 9.7 9 T<5 9.5 
10.2 10.4 10 12.3 11.1 
10.0 10.9 11 14.2 14.5 
10.2.7 8.0 12 12.5 
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MTB > Mann-Whitney 'Denoml' ‘Denom2'; 
SUBC> Alternative 0. 


Mann-Whitney Confidence Interval and Test 

Denom1 N = 12 Median = 11.450 

Denom2 N = 11 Median = 9.500 

Point estimate for ETA1-ETA2 is 2.000 

95.5 Percent CI for ETAI1-ETA2 is (0.499,3.501) 

W = 184.0 

Test of ETAl1 = ETA2 vs. ETA1 not = ETA2 is significant at 0.0151 
The test is significant at 0.0150 (adjusted for ties) 


Ans. (a) This command requests a Mann-Whitney test for the data in columns named Denom! and 
Denom2. 
(b) This subcommand indicates that the hypothesis is two-tailed. 
(c) This is the same as R,, the sum of ranks for sample 1. 
(d) This line gives the p value for the test as 0.0151. This value is close to the p value obtained in 
problem 14.7. 


KRUSKAL-WALLIS TEST 


14.9 Thirty individuals were randomly divided into 3 groups of 10 each. Each member of one group 
completed a questionnaire concerning the Internal Revenue Service (IRS). The score on the 
questionnaire is called the Customer Satisfaction Index (CSI). The higher the score, the greater 
the satisfaction. A second set of CSI scores were obtained for garbage collection from the 
second group, and a third set of scores were obtained for long distance telephone service. The 
scores are given in Table 14.16. Perform a Kruskal-Wallis test to determine if the population 
distributions differ. Use significance level a = .01. 


Table 14.16 


Ans. The data in Table 14.16 are combined and ranked as one group. The resulting ranks are given in 
Table 14.17. The following rank sums are obtained for the three groups: R; = 65.0, R2 = 154.0, 
and R; = 246.0. 


Table 14.17 


Garbage collection 


350 


NONPARAMETRIC STATISTICS [CHAP. 14 


The computed value of the Kruskal-Wallis test statistic is found as follows: 


k p2 2 2 2 
we —2 /sRi ~3(n +1) = ee ce ern cat es — 3x31 =21.138 
n(nt+1)] 1 ni 30x31/ 10 10 10 
The critical value is found by using the Chi-square distribution table with df = 3 - 1 = 2 to equal 


9.210. The null hypothesis is rejected since the computed value of the test statistic exceeds the 
critical value. 


14.10 The Minitab analysis for problem 14.9 is shown below, Answer the following questions. 


(a) Explain the command Kruskal-Wallis 'CSI' 'Group'. 
(b) If the average ranks are multiplied by 10, what do you obtain? 
(c) What does the rowH = 21.14 DF = 2 P = 0.000 give you? 


MTB > Kruskal-Wallis 'CSI' 'Group'. 


Kruskal-Wallis Test 
Kruskal-Wallis Test on CSI 


Group N Median Ave Rank Z 
1 10 5. 750 6.5 -3.96 
2 10 16.000 15.4 -0.04 
3 10 24.000 24.6 4.00 
Overall 30 15.5 
H = 21.14 DF = 2 P = Q.000 
H = 21.48 DF = 2 P = 0.000 (adjusted for ties) 


Ans. (a) The command asks for a Kruskal-Wallis test for the data in the column called CSI. The 
column called Group identifies the three samples. 
(b) R, = 65, R2 = 154, and R; = 246. 
(c) This row gives the computed value of the test statistic, the degrees of freedom for the Chi- 
square distribution, and the p value = 0.000. 


RANK CORRELATION 


14.11 The body mass index (BMI) for an individual is found as follows: Using your weight in 


pounds and your height in inches, multiply your weight by 705, divide the result by your 
height, and divide again by your height. The desirable body mass index varies between 19 and 
25. Table 14.18 gives the body mass index and the age for 20 individuals. Find the Spearman 
rank correlation coefficient for data shown in the table. 


Table 14.18 
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Ans. Table 14.19 contains the ranks for the BMI values and the ranks for the ages given in Table 14.18. 
The formula for the Pearson correlation coefficient given in formula (/4.4) is applied to the ranks 
of the BMI values and the ranks of the ages. Let x represent the ranks of the BMI values and y 
represent the ranks of the ages. 


Ex =9.04+ 11.04 15.04+---+ 17.04 1.0 = 210, x? =81 + 121 +2254+---+ 289 + 1 = 2869, 
Ly = 8.5 + 13.0+ 18.0+--- +204 2.5 =210, Ly’ = 72.25 + 169 + 324+--- +400 + 6.25 = 2868, 
Lxy=9x8.54+11x% 134+ 15x18 +---+17x 204+ 1% 2.5 = 2646.5 


2 2 
s,,= Ex? — 2%" _ ag69 - 2205 = 664, Syy = Ey? — (2Y) - 2868 — 2205 = 663 
n n 
y 
Sy = Exy — 22Y) _ 9646.5 - 2205 = 441.5 
n 
Sxy 441.5 


= 0.665 


a [SxxSyy  ¥664x 663 


Table 14.19 


14.12 Use the results found in problem 14.11 to test the null hypothesis Ho: p, = 0 vs. H,: p, # 0 at 
level of significance a = .05. 


Ans. The critical values are +1.96. The computed value of the test statistic is found as follows: z* = 


¥20-—1 x .665 = 2.90. We conclude that there is a positive association between age and body 
mass index. 


RUNS TEST FOR RANDOMNESS 


14.13 The gender of the past 30 individuals hired by a personnel office are as follows, where M 
represents a male and F represents a female. 


FFFMMMMEMFMFFFFMMMMMMMEFFMMMFFEF 


Test for randomness in hiring using level of significance @ = .05. 


Ans. There are n,; = 14 F’s and nz =16 M’s. There are R = 11 runs. If the hiring is random, the mean 


number of runs is 
2n,n, 2x14x16 
w= ———— +] = ——— + 
n, +n, 30 


l= 15.933 


and the standard deviation of the number of runs is 


a eo ee 5679 
(n, + ny)" (n, +ny-1) 900 x 29 
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The computed value of the test statistic is Z* = ee = Saat —1.84. The critical values 
oO : 
are +1.96, and since the computed value of the test statistic is not less than —!.96, we are unable 


to reject randomness in hiring with respect to gender. 


The Minitab analysis for the runs test is shown below. Compute the p value corresponding to 
the value Z* = —1.84 in problem 14.13 and compare it with the p value given in the Minitab output. 


C1 

0 0) 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 
1 1 1 1 0 8) 1 1 1 0 0) 0 

MTB > Runs .5 Cl. 

The observed number of runs = 11 

The expected number of runs = 15.9333 


16 Observations above K 14 below 
The test is significant at 0.0658 
Cannot reject at alpha = 0.05 


Ans. The area to the left of Z* = -1.84 is .5 — .4671 = .0329. Since the alternative hypothesis is two- 
tailed, the p value = 2 x .0329 = 0.0658. This is the same as the p value given in the Minitab 
output. 


Supplementary Problems 


NONPARAMETRIC METHODS 


14.15 


14.16 


Three mathematical formulas for dealing with ranks are often utilized in nonparmetric methods. These 
formulas are used for summing powers of ranks. They are as follows: 


(eee Oe 
2 
24243244 ta MOE) 
6 
Pa regte. ante Men! 
4 


Use the above special formulas to find the sum, sum of squares, and sum of cubes for the ranks | 
through 10. 


Ans. sum = 55, sum of squares = 385, and sum of cubes = 3,025. 
Nonparametric methods often deal with arrangements of plus and minus signs. The number of different 
a+b +b 
possible arrangements consisting of a plus signs and b minus signs is given by or (* b } Use 
a 


this formula to determine the number of different arrangements consisting of 2 plus signs and 3 negative 
signs and list the arrangements. 


5! 
~ 21x 3! 


3 
Ans. The number of arrangements is | = 10. The arrangements are as follows: 
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t+—---, +-+--, +--4+-, t---4+, -F4--, -t-4t-, -t--4+, --t+4-, 
—-—+-4+, ---++ 


SIGN TEST 


14.17 


14.18 


The police chief of a large city claims that the median response time for all 911 calls is 15 minutes. In a 
random sample of 250 such calls it was found that 149 of the calls exceeded 15 minutes in response 
time. All the other response times were less than 15 minutes. Do these results refute the claim? Use 
level of significance .05. Assume a two-tailed alternative hypothesis. 


Ans. Assuming the claim is correct, we would expect ft = np = 250 x .5 = 125 calls on the average. The 
standard deviation is equal 0 = ¥npq = 7.906. The z value corresponding to 149 calls is 2.97. 


Since the computed z value exceeds 1.96, the results refute the claim. 


The claim is made that the median number of movies seen per year per person is 35. Table 14.20 gives 
the number of movies seen last year by a random sample of 25 individuals. 


Table 14.20 


Test the null hypothesis that the median is 35 vs. the alternative that the median 1s not 35. Use a = .05. 


Ans. Table 14.21 gives a plus sign if the value in the corresponding cell in Table 14.20 exceeds 35, a0 
if the corresponding cell value equals 35 and a minus sign if the corresponding cell value is less 
than 35. There are 12 plus signs, 10 minus signs, and 3 zeros. 


The below Minitab output indicates a p value = 0.8318, indicating that there is no statistical 
evidence to reject the claim that the median is 35. 


MTB > print cl 
19 39 45 35 66 44 12 29 70 44 37 
31 23 50 52 29 35 20 52 26 35 41 
25 88 26 


MTB > STest 35 'movies'; 
SUBC> Alternative 0. 


Sign Test for Median 
Sign test of median = 35.00 vs. not = 35.00 
N Below Equal Above P Median 


Movies 25 10 3 12 0.8318 35.00 
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WILCOXON SIGNED-RANK TEST FOR TWO DEPENDENT SAMPLES 


14.19 Use the normal approximation to the Wilcoxon sign-rank test statistic to solve problem 14.18. Note that 
this application of the Wilcoxon sign-rank test does not involve dependent samples. We are testing the 
median value of a population using a single sample. 


Ans. Table 14.22 shows the computation of the signed ranks. Note that 0 differences are not ranked and 


are omitted from the analysis. The sum of positive ranks is W* = 150.5, and the absolute value of 
the sum of the negative ranks is W = 102.5. The test statistic used in the normal approximation 
is: 


W-n(n+0/4 — _:150.5—22x23/4 


Jn(n+1(2n+1)/24 ¥22x 2345/24 


Since the computed value of the test statistic does not exceed 1.96, the null hypothesis is not 
rejected. 


Z* = = 0.779 


Table 14.22 


D=X-35 pe BT Rank of|D| | Signed rank 


16.0 
25 
12.0 

Not used 
20.0 
8.5 
19.0 
5.0 
21.0 
8.5 
1.0 
25 
14.0 
15.0 
17.5 
5.0 

Not used 
12.0 
17.5 
8.5 

Not used 
5.0 
12.0 
220 
8.5 


14.20 The Minitab output for problem 14.19 is shown below. Answer the following questions. 


(a) 
(b) 
(c) 
(d) 


Explain the command Wrest 35 'movies'; 

Explain the subcommand Alternative 0. 

Explain the output. 

Compute the p value corresponding to the value Z* = 0.779 in problem [4.29 and compare it with 
the p value given in the Minitab output. 


MTB > print cl 


39 45 35 66 44 12 20 70 
37 34 23 50 52 29 B ie 25 
26 BS 41 25 88 26 
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MTB > West 35 'movies'; 
SUBC> Alternative 0. 


Wilcoxon Signed Rank Test 
Test of median = 35.00 vs. median not = 35.00 


N for Wilcoxon Estimated 
N Test Statistic P Median 
Movies 25 22 150.5 0.445 37.50 


Ans. (a) Wtest indicates a Wilcoxon sign-rank test. The number 35 is the median value to be tested. 
Movies indicates the column containing the sample data. 
(b) The subcommand indicates that the research hypothesis is two-tailed. 
(c) The output indicates that the sample size is 25. However only 22 of the values were used 
since there were 3 differences that were 0. The value for the Wilcoxon Statistic is the same as 
W* in problem 14.19. The p value is 0.445. 
(d) P value = 2 « (.5 — .2823) = 0.4354 


WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES 


14.21 The leading cause of posttraumatic stress disorder (PTSD) among men is wartime combat and among 
women the leading causes are rape and sexual molestation. The duration of symptoms was measured for 
a group of men and a group of women. The times of duration in years for both groups are given in 
Table 14.23. Use the Wilcoxon rank-sum test to test the research hypothesis that the median times of 
duration differ for men and women. Use level of significance a = .01. Use the normal approximation 
procedure to perform the test. 


Table 14.23 


Men Women 


Ans. Table 14.24 gives the rankings when the two samples are combined and ranked together. 


Table 14.24 


R, = 87.5 R, = 122.5 
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The sample sizes are n, = 10 and n2 = 10. The rank sum associated with sample | is R, = 87.5. The 


test statistic is computed as follows: 


Ry =ny(nj tng +)/2_ _ 875-1000+104)/2__ g, 


Jnyng(nj+np+1)/12  Y10x10x(10+10+1)/12 | 


Z= 


The critical values are £ 2.58. Since the computed value of the test statistic does not exceed the 
left side critical value, the null hypothesis is not rejected. The p value is computed as follows. The 
area to the left of -1.32 is .5 — .4066 = .0934, and since the alternative is two sided, the p value = 
2 x .0934 = 0.1868. If the p value approach to testing is used, we do not reject the null since the p 


value is not less than the preset a. 


14.22 The Minitab solution for problem 14.21 1s shown below. Answer the following questions. 
(a) Explain the command line Mann-Whitney 95.0 'Men' 'Women'; 
(b) Explain the subcommand Alternative 0. 
(c) Discuss the output. 


MTB > print cl c2 


Row Men Women 
1 2.5 3.0 
2 3.3 4.0 
3 3.5 4.0 
4 4.0 6.5 
5 5.0 7.0 
6 5.0 8.5 
7 6.5 10.0 
8 6.5 12.5 
9 10.0 13.0 
10 15.0 17.5 


MTB > Mann-Whitney 95.0 'Men' 'Women' ; 
SUBC> Alternative 0. 


Mann-Whitney Confidence Interval and Test 

Men N = 10 Median = 5.000 

Women N = 10 Median = 7.750 

Point estimate for ETA1-ETA2 is -2.250 

95.5 Percent CI for ETA1-ETA2 is (-6.503,1.002) 
W= 87.5 


Test of ETA1 = ETA2 vs. ETA1 not = ETA2 is significant at 0.1988 


The test is significant at 0.1971 (adjusted for ties) 


Cannot reject at alpha = 0.05 


Ans. (a) The command line requests a Mann-Whitney analysis using the data in the columns labeled 
Men and Women. Set a 95% confidence interval on the difference in the medians for the two 


populations. 
(b) The subcommand indicates that the test 1s two-tailed. 


(c) The output gives a 95% confidence interval on the difference in the two population medians. 


The p value is given to be 0.1988. We are unable to reject the null hypothesis. 


KRUSKAL-WALLIS TEST 


14.23 The cost of a meal, including drink, tax and tip was determined for ten randomly selected individuals in 
New York City, Chicago, Boston, and San Francisco. The results are shown in Table 14.25. Test the 
null hypothesis that the four population distributions of costs are the same for the four cities vs. the 


hypothesis that the distributions are different. Use level of significance a = .05. 
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Table 14.25 


Ans. Table 14.26 gives the results when the four samples are combined and ranked together. 


Table 14.26 


k p2 2 2 2 32 
We = 12 a =3(n 4s {2 259.5 , 164.5 ; 18 23  aeaee oa 
n(n +1) 10 10 10 10 


The critical value is 7.815. We cannot conclude that the distributions are different. 


14.24 The Minitab output for the Kruskal-Wallis test in problem 14.23 is shown below. 
(a) Explain the command line. 
(6) Explain the output. 


MTB > Kruskal-Wallis ‘cost' ‘city'. 


Kruskal-Wallis Test 
Kruskal-Wallis Test on cost 


City N Median Ave Rank Z 
1 10 27.75 26.0 1.70 
2 10 22.88 16.5 -1.27 
3 10 21325 16.3 -1.31 
4 10 27.00 23.3 0.87 
Overall 40 2005 
He= 5.24 DF = 3 P= 0.155 
H= 5.24 DF = 3 P = 0.155 (adjusted for ties) 


Ans. (a) The command line Kruskal-Wallis ‘cost' '‘city'. requests a Kruskal-Wallis test 
for the responses in the column called cost and the cities identified in the column called city. 
(b) The output lists the 4 cities, the median cost for each city, and the mean rank for each city. In 
addition, the computed Kruskal-Wallis test statistic is H = 5.24. The p value is 0.155. 
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RANK CORRELATION 


14.25 Table 14.27 gives the percent of calories from fat and the micrograms of lead per deciliter of blood for a 
sample of preschoolers. Find the Spearman rank correlation coefficient for data shown in the table. 


Table 14.27 
Percent Fat Calories 


Ans. The Spearman correlation coefficient = 0.919. 


14.26 Test for a positive correlation between the percent of calories from fat and the level of lead in the blood 
using the results in problem 14.25. 


Ans. The computed test statistic is 3.05 and the a = .05 critical value is 1.65. We conclude that a 
positive correlation exists. 


RUNS TEST FOR RANDOMNESS 


14.27 The first 100 decimal places of m contain 51 even digits (0, 2, 4, 6, 8) and 49 odd digits. The number of 
runs is 43. Are the occurrences of even and odd digits random? 


Ans. Assuming randomness, H = 50.98, o = 4.9727, and z* = —1.60. Critical values = + 1.96. The 
occurrences are random. 


14.28 The weights in grams of the last 30 containers of black pepper selected from a filling machine are 
shown in Table 14.28. The weights are listed in order by rows. Are the weights varying randomly about 
the mean? 


Ans. The mean of the 30 observations is 28.06. Table 14.29 shows an A if the observation is above 
28.06 and a B if the observation 1s below 28.06. 


The sequence AABBBBAAAABBAAABAAAAABBBBBBAAA has I7 A’s, 13 B's, and 9 runs. 
If the weights are varying randomly about the mean, the mean number of runs is 15.7333 and the 
standard deviation is 2.6414. The computed value of z is z* = -2.55. The critical values for a = 
.OS are + 1.96. Randomness about the mean is rejected. 
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P 

.05 .10 .20 30 .40 50 .60 10 .80 .90 95 

9500 .9000 .8000 .7000 6000 5000 4000 3000 2000 1000 .0500 
0500 .1000 .2000 3000 4000 5000 6000 7000 8000 .9000 § .9500 
9025 8100 .6400 .4900 3600 .2500 .1600 .0900 0400 0100 .0025 
0950 =.1800 §=.3200 §=6—.4200 §==.4800 §==5000 86.4800) §=6=.4200)§=—=.3200 )=—.1800 )=—.0950 
0025 0100 0400 0900 .1600 .2500 .3600 .4900 6400 8100 .9025 
8574 .7290) =.5120) )=.3430) 2160 3.1250 )3=.0640))3=—.0270 3S 0080 Ss 0010~—s—«.0001 
1354 .2430 3840) — 4410) = 4320S .3750)— 2880S 1890 )=— .0960)3=s.0270~—s—.0071 
0071 =.0270 =.0960 .1890 .2880 3750 §=6.4320 =6.4410 =—.3840) = 2430-1354 
000! 0010 0080 0270 0640 1250 2160 .3430 5120 .7290 8574 
8145 .6561 4096 .2401 .1296 .0625 .0256 .0081 .0016 .0001 .0000 
1715 2916 =.4096 = 4116 = 3456) 2500) .1536) = 0756) = .0256 )3=—.0036 = .0005 
0135 0486 1536 8.2646 §«.3456) «6.3750 = 3456S 2646) 1536) 0486 = 0135 
0005 .0036 «=.0256 )=.0756— 1536) 2500 = 3456S .4116 = .4096)— 29161715 
0000 .0001 0016 0081 0256 0625 1296 .2401 .4096 6561 8145 
7738 =.5905)- 3277) «1681 = 0778 = 0312, 0102S 0024 = .0003.— «0000 = .0000 
2036 §=.3280 §=6©.4096)—s 3602.) 2592S 1562) 0768 )3=— 0284 = 0064S .0005~=—s_ «0000 
0214 =.0729 = .2048) = 3087) Ss 3456) = .3125) 2304) .1323) 0512S 0081 ~=—s «0011 
OTL .0081 0512) = 1323) 2304) .3125) 3456) 3087) 2048 )=— 0729S 0214 
0000 .0004 0064 0283 0768 .1562 2592 .3601 .4096 3281 .2036 
0000 §=.0000 §=.0003) =.0024)=.0102) 0312) .0778 — 1681 = 3277 Ss 5905 .7738 
7351 5314 = 2621) = £1176) 0467) 0156) = 0041 = 0007) (OOO =. 0000~—s— «0000 
2321 = 3543) 3932) 3025. Ss .1866 = 0937. Ss 60369) Ss 0102s «.0015 =.000!1 = .0000 
0305 .0984 «=.2458) = .3241) 3110) 23441382) 0595) 0154) = .0012_~—s 0001 
0021 = .0146 =.0819 = =.1852) 2765) 3125) 27651852) 0819S 0146 — 0021 
0001 0012 0154 0595 .1382 §=.2344 3110) =—.3241) = 62458) = 0984 = 0305 
0000 = .0001 .0015 0102 .0369 0937 .1866 .3025 3932 3543 = .2321 
0000 =.0000 =.0001 0007 0041 0156 0467) 1176) .2621 5314 = 735] 
6983 .4783 «92097 = 0824.) Ss 0280 Ss 0078 )=— 0016 = .0002.—s- .0000)=—.0000_~=—s_ 0000 
2573 = 3720) 3670) .2471 £1306) = 0547) 0172s 0036 »=—. .0004.—s 0000 = .0000 
0406 = .1240) .2753)) «3177'S 2613). 1641) = 0774) 0250 = 0043S «0002 = .0000 
0036 §=.0230) =«.1147)— 2269) 2903) 2734S .1935. 0972, Ss 0287) = 0026 = .0002 
0002. © .0026) =.0287) Ss .0972)— 1935) .2734)— 2903) .2269— 1147) 0230 = 0036 
0000 .0002 0043 0250 0774 .1641 .2613 3177) = 2753 £1240) .0406 
0000 .0000 .0004 0036 0172 0547 1306 8.2471 3670) =.3720-~—.2573 
0000 =.0000 §=.0000 §=.0002) «0016 §=6.0078)§=.0280)3=—.0824)—s 2097) Ss 4783 6983 
6634 4305 .1678 0576 .0168 0039 .0007 0001 0000 0000 0000 
2793 38260) 3355) .1977)S 60896) 0312, 0079's 0012. .0001 ~=.0000~=—- .0000 
0515 1488) = .2936))— 2965) 2090S 1094. 0413) .0100 )3=.001!) =—.0000)~=—.0000 
0054 =—.0331) = 1468) .2541) 2787) 2187) 1239'S 0467) =—s 0092S 0004 = .0000 
0004 .0046 0459 =.1361 = .2322,) 2734) 2322) 1361 = 0459 = 0046 = .0004 
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z 
n x 05 10 20 30 40 50 60 70 80 90 95 
5 .0000 .0004 0092 0467) 1239 2187 2787 §=©.2541 1468 0331 .0054 
6 .0000 .0000 O01! O100 .0413 .1094 2090 .2965 2936 .1488 .OSI5 
7 0000 .0000 0001 0012 0079 0312 .0896 1977 .3355 .3826 .2793 
8 .0000 .0000 .0000 0001 .0007) = .0039 0168 = .0576 =.1678 4305 6634 
9 O 6302 .3874 .1342 0404 0101 .0020 0003 .0000 ©0000 §=.0000~ = .0000 
1 .2985 .3874 .3020 .1556 .0605 .0176 .0035 .0004 .0000 0000 = .0000 
2 .0629 .1722 = .3020) §=©.2668)=— 1612) 0703S 0212. .0039 = .0003.—s_ .0000~=—— .0000 
3.0077 .0446 = .1762) =.2668 = .2508_~—«1641) = 0743) 0210 »=.0028 )=— 0001 ~=—.0000 
4 .0006 .0074 .0661 1715) .2508 §=.2461 =.1672 0735 .0165 .0008 .0000 
5 .0000 .0008 0165 0735 .1672 2461 2508 .1715 .0661 .0074 .0006 
6 .0000 .0001 0028 0210 0743 .1641 2508 .2668 .1762 .0446 .0077 
7 .0000 «6.0000 )§=©.0003) —.0039-Ss 0212) 0703) «1612. 2668 )=—.3020—s 1722S .0629 
8 .0000 .0000 0000 .0004 0035 0176 0605 1556 .3020 3874 § .2985 
9 .0000 .0000 6.0000 6.0000 )§«6©=.0003) = =—.0020)— 0101 )=— 04041342) 3874 ~——.6302 
10 0  .5987  .3487 = .1074 =.0282. Ss .0060 »3=—.0010 = .000!_ 3=.0000)3=—.0000)3=—.0000-~=—s— 0000 
1 3151 .3874 =.2684 = .1211 = 0403) 0098 )=—s 0016 = .0001_ = 0000 )=3=—.0000-~—s_ 0000 
2 .0746 .1937 .3020) =©.2335) 1209S 0439S 0106 = 0014 —s 0001 =3=—.0000 = .0000 
3 0105 0574 2013) 2668) =.2150 = 1172, 0425s 0090 »=— 0008 )3=— 0000 = .0000 
4 0010 0112 0881 200! .2508 .2051 1EtS) = .0368 )»=.0055) Ss 0001 = .0000 
5 .0001 .0OIS 0264 .1029 2007) .2461 2007) 1029 0264 0015  .0001 
6 .0000 .000! 0055 .0368 1115 2051 = =.2508 =.2001 8.0881 0112 0010 
7 .0000 = .0000) =©.0008 =.0090) = .0425) 1172) 2150) .2668 = .2013) 0574 ~—s 0105 
8 .0000 0000 OOO! 0014 0106 0439 1209 2335 .3020 .1937 .0746 
9 .0000 .0000 §«€©=.0000)«36.0001)«=—.0016— 0098) 0403-1211) 268438743151 
10 6.0000) )=6.0000.)3=—.0000)=—.0000)3=.000l_=S 0010) = 0060) .0282)— «1074 3487) = 5987 
1! O .5688 .3138 .0859 0198 .0036 0005 .0000 .0000 0000 0000 8.0000 
1 .3293 3835) = .2362)=—_ 0932s «0266 )3=—. .0054.—s .0007)=—s- .0000)=— .0000 = .0000-~=— .0000 
2 .0867) 2131) = =.2953) 1998 )=—.0887) Ss .0269 = 0052. 0005. = .0000)=— .0000~=—s 0000 
3. 0137) .O7f0 .2215 .2568 .1774 .0806 .0234 .0037 .0002 0000 8.0000 
4 0014 0158 .1107) 2201) = .2365 1611 070! 0173 0017 0000 = .0000 
5 .0001 .0025 .0388 1321 2207 .2256 .1471 .0566 .0097 .0003 = .0000 
6 .0000 0003 0097 0566 1471 = .2256 2207) =.1321 §=.0388 )=—.0025—s 0001 
7 0000 §=6©.0000) «©.0017) 0173) 0701_~— 1611) 23652201 1107. 0158 ~=—.0014 
8 .0000 0000 0002 .0037 0234 0806) .1774 2568 .2215 0710  .0137 
9 0000 6.0000) §«6©«.0000)§=6.0005) 0052) 0269 = 0887) «1998 )=— 2953) 2131 = .0867 
10 .0000)§=©«.0000 )§=3=.0000)=—=.0000) = 0007) 0054 0266)=— 0932) 2362 = .3835. 3293 
11 .0000 §=6©=.0000)§=6.0000 3=—.0000)3=.0000 = 0005) 0036) 0198 )=— 0859) Ss 3138) = 5688 
12 5404 .2824 0687 0138 0022 0002 0000 0000 0000 0000 = =6.0000 


0988 .2301) = 2835S .1678 )3=.0639' 0161 =—.0025.- .0002.)=—s 0000 =—.0000-~=—s 0000 
01735 0852) 2362) 2397) 1419) 0537) 0125. 0015S 0001 ~=—.0000)~=— 0000 
0021 = =.0213) .1329) 2311) 2128 )=—.1208 )3=— 0420S 0078 = 0005s .0000~—s .0000 
; , ‘ ; 1934 1009 .0291 .0033 .0000 .0000 
0000 .000S 0155 0792 .1766) 2256 1766 0792 0155 .0005 .0000 
0000 .0000 0033 0291 .1009 1934 .2270 .1585 .0532 .0038 .0002 
0000 =.0000) =.0005) = .0078)— 0420) 1208) 2128) 2311) = 1329) 02130021 
0000 §=6—.0000 )3=— 0001 = 0015) 0125. 0537) 1419) 2397) 2362) 08520173 
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P 


13 0 5133 .2542 .0550 0097 .0013 .0001 .0000 .0000 0000 6.0000 =.0000 
1 .3512  .3672 .1787 =.0540) =.0113 =.0016 §=.000i §=.0000 §=6—.0000 )§=6—.0000 = .0000 
2 .1109 2448) =.2680)— 1388 )=— 0453) 0095. 0012, 0001 =.0000 )§=—.0000 = .0000 
3 0214 0997) 2457) 2181 = =.1107 =.0349 =—.0065 = .0006)3=— 0000 )3=—.0000 = .0000 
4 0028 0277 1535 .2337 .1845 .0873 = .0243 »=.0034 = 0001 =.0000 = .0000 
5 .0003 .0055 .0691 .1803) .2214 .1571 0656 0142 0011 .0000 .0000 
6 .0000 .0008 0230 .1030 .1968 2095 .1312 0442 0058 0001 .0000 
7 0000 .0001 0058 .0442 1312 .2095 .1968 .1030 .0230 .0008  .0000 
8 .0000 0000 0011 0142 0656 1571 .2214 .1803 .0691 0055 .0003 
9 .0000 0000 0001 .0034 0243 0873) .1845 = 2337) = .1535 = .0277 = .0028 
10 .0000 .0000 0000 0006 0065 0349 1107) 2181 8.2457) §=.0997 = 0214 
11 .6000 0000 0000 0001 0012 0095 0453 1388 2680 2448 1109 
12.0000 §=6.0000 §=.0000 §«6©=.0000)§=.0001)3=—.0016 = 0113) 0540) 1787) 3672 == 3512 
13, .0000 §=6=—.0000 )§=.0000)§3=—=.0000 )3=—.0000 )3=—.000!_)=S— 0013) 0097) 05502542 5133 
14 0 4877 .2288 .0440 .0068 0008 0001 .0000 0000 0000 0000 8.0000 
1 .3593 3559 = .1539 0407) Ss .0073) Ss .0009 Ss 0001 »3=.0000)=—.0000)3=—.0000-~=—s 0000 
2 1229) 2570) 2501-1134) 0317) 0056 )=— .0005.—- «.0000)=.0000)=3=—.0000)=—s_ .0000 
3 0259 1142) 2501) .1943) 0845) .0222) Ss 0033'S 0002 Ss 0000 »=—.0000 = .0000 
4 .0037 0349 1720) .2290) .1549 0611 .0136 0014 .0000 0000 0000 
5 .0004 .0078 0860 1963 .2066 1222 6408 .0066 .0003 .0000 0000 
6 .0000 0013 .0322 1262 .2066 1833 .0918 .0232 .0020 .0000 .0000 
7 0000 = .0002) 0092 «4.0618 1574 2095 .1574 .0618 .0092 .0002 .0000 
8 .0000 .0000 0020 0232 0918 1833 .2066 1262 .0322 .0013 .0000 
9 0000 0000 0003 .0066 0408 1222 .2066 .1963 .0860 .0078 0004 
10 .0000 §6©.0000 §=.0000 «6.0014 =.0136) Ss .0611) = £1549) 2290) 1720) 0349's 0037 
1} .0000 0000 0000 .0002 0033 0222 .0845 .1943  .2501 1142 = .0259 
12.0000 §=6=.0000 §=6=.0000 )§=6§=.0000 )§=.0005) = 0056S 0317) 1134) .2501) =) .2570 1229 
13. 0000 §=—.0000 )§=.0000 )§=3=.0000)§=3=.000!) )=—.0009 Ss 0073S 0407_~—s««1539 3559 = .3593 
14.0000 §=©=.0000 )§=3=.0000 §=.0000 §=3=.0000 )§3=—=.000!1_)=—.0008)3=— «0068 = 0440)=— 2288 )=— 4877 
15 4633. .2059 =.0352.-—s- .0047. Ss 0005. .0000»=—s «.0000)3=— 0000) = 0000) = .0000-=—s .0000 
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P 
n xX 05 10 20 30 40 50 60 70 80 90 95 
16 0 .4401 1853 0281 =.0033. =.0003) =.0000)§=—.0000 )3=—=.0000 )§3=—=.0000)§=—=.0000-~=— 0000 
1 3706 8.3294 = .1126 §=.0228 §=.0030 §=—.0002 »=—=.0000 §=—.0000 §=.0000)§=6—.0000)~=— 0000 
2 .1463 2745) .2111 .0732 0150 0018 0001 0000 0000 .0000 8.0000 
3 0359 1423) 2463 =.1465 = 0468) )=— 0085) (0008 )=—.0000 )3=—.0000 = 0000-~—s— 0000 
4 .0061 .0514 2001 .2040 .1014 0278 0040 .0002 .0000 0000 = .0000 
§ .0008 .0137. .1201 .2099 .1623 .0667 0142 0013 .0000 0000 = §8.0000 
6 .0001 .0028 0550 .1649 .1983 .1222 .0392 .0056 .0002 .0000 .0000 
7 OOOO .0004 0197. .1010 .1889 .1746 .0840 0185 .0012 .0000 .0000 
8 .0000 .0001 .0055 0487 1417) .1964 .1417 .0487 0055 .0001 .0000 
9 .0000 0000 0012 0185 0840 .1746 1889 1010 0197 .0004 .0000 
10 .0000 ©0000 0002 0056 .0392 1222 =.1983 = .1649 §=.0550 §=.0028 ~=—.0001 
1t .0000) §6©«.0000)=6.0000-)S 0013) 0142) = 0666 )=— 1623) 2099 .1201) = 0137 = .0008 
i2 0000 §=6©—.0000 363.0000) 3=—.0002._—s 0040) 0278S 1014S 2040) = .2001_)3=— 0514 = 0061 
13 .0000 §=6©.0000 =.0000 0000 .0008 0085 0468 1465 2463 1423 0359 
14 .0000 .0000 .0000 .0000 0001 0018 0150 0732 2111 .2745 = .1463 
15 .0000 .0000 0000 =©.0000 6.0000) §«=©=.0002) 0030) 0228 )=—-1126 3=—.3294 Ss 3706 
16 .0000 .0000 §6©.0000 §=6§—=.0000 )§=63=.0000)§=3=.0000)=—.0003— 0033S «0281 1853 4401 
17 O 4181 1668 .0225  .0023 .0002 .0000 0000 0000 0000 8.0000 §=.0000 
1 .3741 .3150 .0957 .0169 .0019 .0001 .0000 .0000 .0000 .0000 8 .0000 
2 1875 .2800 .1914 0581 0102 OO10 0001 0000 .0000 8.0000 = .0000, 
3 0415 .1556 2393 .1245 0341 .0052 .0004 0000 .0000 0000 = .0000 
4 .0076 .0605 .2093 .1868 .0796 .0182 .0021 0001 0000 0000 = .0000 
§ 0010 0175 .1361 = .2081 1379 =.0472 = 0081 »=.0006 )3=— 0000 =—s «.0000~=—s_ 0000 
6 0001 .0039 .0680 .1794 .1839 0944 .0242 .0026 0001 .0000 .0000 
7 0000 .0007 .0267 1201) .1927 .1484 .0571 0095 0004 0000 .0000 
8 .0000 0001 0084 0644 1606 .1855 .1070 0276 .0021 .0000 .0000 
9 .0000 .0000 .002! .0276 .1070 1855 .1606 .0644 0084 0001 .0000 
10 .0000 §.0000 §6©.0004) «=—.0095)0571_~— 1484) .1927. = .1201 =.0267 = 0007: = .0000 
11 .0000 §=©.0000)=.0001)_ = 0026S 0242) 094418391784 = .0680)3=—.0039_~—s—«.0001 
12.0000 .0000 0000 0006) .0081 0472 = 1379 = .2081) = 1361 = 0175S 0010 
13 .0000 §=.0000 §.0000 .000!) = 0021) «63.0182 §=.0796)=— £1868 )3=— 2093) 0605 = 0076 
14 .0000 .0000 0000 .0000 «6.0004 ) «6.0052. 0341-1245) .2393) 1556 )—.0415 
15 .0000 .0000 §6©=.0000)§«§=.0000)«3=3—.0001_ = 0010 = .0102)— 0581 = 1914 2800) .1575 
16 .0000 .0000 0000 6.0000 «6.0000 )§«=3=.000!1)=.0019. 0169) 0957) «31503741 
17 .0000 §=6©.0000 36.0000 )3=.0000)3=—0000-)=S— 0000 0002) 0023) 0225.—s«i1668=—s 4 181 
18 3972, 1501 =.0180) 36.0016 )3=—.0001_~3=—s 0000 )3=.0000)3=— 0000 = 0000 = .0000)=—— 0000 
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19 0  .3774 .1351 .0144 .OO1| .000! §=.0000 §6©.0000)§=©.0000 )§=.0000)=—.0000_~—s .0000 
1 .3774 2852 .0685 .0093 .0008 .0000 0000 .0000 .0000 8.0000 = =.0000 
2  .1787) 2852, .1540) = 0358 )=—.0046)3=— 0003S «.0000)=— 0000) 0000s .0000~—— 0000 
3. 0533) 1796) 2182) 0869) 0175. «0018 = .0001_ = 0000. = 0000 = .0000 = .0000 
4 0112 0798 2182 .1491 .0467 .0074 0005 0000 §=©.0000 §=6.0000 = .0000 
5 0018 0266 .1636 .1916 .0933 0222 .0024 .000!1 .0000 = =6.0000 = .0000 
6 .0002 0069 0955 .1916 .1451 0518 .0085 0005 .0000 .0000 = .0000 
7 0000 0014 0443) 1525) 1797, 0961) 0237) 60022. .0000)=—.0000~=—s .0000 
8 .0000 0002 0166 0981 1797 .1442 0532 0077 = .0003 = .0000 =.0000 
9 .0000 .0000 0051 0514 1464 .1762 .0976 0220 .0013 .0000 = .0000 
10 .0000 §=.0000) §=.0013)) 0220) §=.0976 =.1762)— 1464S 0514) .0051 )=—.0000 = .0000 
11 .0000 §=©.0000)§=©.0003) «0077, 0532s 1442) 1797) 60981 )=— 0166 = .0002 = .0000 
12.0000) =.0000)§=—=.0000)=—.0022)—Ss 0237 ~—s 0961 1797) 1525) 0443) 0014 = .0000 
13.0000) =.0000 »=.0000))=— 0005. 0085) 0518 «1451 1916 .0955 = .0069 = .0002 
14 0000) §=©=.0000. =—.0000)=—.0001)_—S_ 60024. 0222) 0933) «1916S 1636 = 0266 ~— 0018 
15 0000 §=.0000) =©.0000) §=.0000 )§=.0005) 0074S 0467) 1491) 2182) 0798 0112 
16 .0000 .0000 6.0000) =©.0000) )=.0001 = 0018) = 0175) 0869) .2182))— 1796 0533 
17.0000 §=.0000 §=.0000 §=.0000 §=.0000) «=.0003)— 0046S 0358) =. 1540) .2852 1787 
18 .0000 .0000 §6©=.0000)§=63=.0000)3=—=.0000)3=—.0000)=— 0008 )=— 00938 0685) .2852.)—— 3774 
19 .9000 §©6©.0000 )§=.0000 )3=—=.0000 )3=.0000)3=—.0000 = 0001_=S 00] 0144)— 1351) 3774 

20 3585) 1216) =. O11S) = .0008)»=— 0000 )3=—.0000 = .0000)=—.0000)=— 0000 = .0000-~—— 0000 
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Appendix 3 


The entries in the table are the critical values of t for the specified number of degrees of freedom and 
areas in the right tail. 


Areas in the Right Tail under the t Distribution Curve 
01 05 025 01 005 001 


63.657 318.309 
9.925 22.327 
5.84] 10.215 
4.604 i Be! 
4.032 5.893 


3.707 5.208 
3.499 4.785 
3.355 4.501 
3.250 4.297 
3.169 4.144 


3.106 4.025 
3.055 3.930 
3.012 3.852 
2917 3.787 
2.947 3.133 


2.921 3.686 
2.898 3.646 
2.878 3.610 
2.861 S579 
2.845 3.552 


2.831 M27 
2.819 3.505 
2.807 3.485 
2.197 3.467 
2.787 3.450 


2.779 3.435 
pat 2 | 3.421 
2.763 3.408 
2.756 3.396 
2.400 3.385 


Appendix 4 


The entries in the table are the critical values of x? for the specified degrees of freedom and areas in the 
right tail. 


Area in the Right Tail under the Chi-square Distribution Curve 
.990 975 950 900 100 050 025 010 005 


().000 0.001 0.004 0,016 2.706 3.84] 5.024 6.635 7.879 
0.020 0.051 0.103 0.211 4.605 5,991 7,378 9.210 10.597 
0.115 0.216 0.352 0.584 6.251 7.815 9,348 11.345 12.838 
0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860 
0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 = 16.750 


0.872 1,237 1.635 2.204 10.645 12,592 14.449 16.812 18.548 
1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 = 20.278 
1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955 
2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.6466 = 23.589 
2.558 3.247 3.940 4.865 15.987 18.307) 20.483 23.209 25.188 


3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 = 26.757 
3.571 4.404 5.226 6.304 18.549 21.026 923.337) 26.217 28.300 
4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 29.819 
4.660 5.629 6.571 7,790 21.064 23.685 26.119 29.141 3E319 
5.229 6.262 7.261 8.547 22.307 = 24.996 27488 30.578 32.801 


5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 = 34.267 
6.408 7,564 8.672 10.085 24.769 27.587 30.19] 33.409 35.718 
7.015 8.231 9.390 10.865 25.989 28.869) 31.526 = 34.805 337.156 
7.633 8.907 YO.E17 $1.65) 27.204 30.144 32.852 36.49} 38.582 
8.260 9.591 10.85] 12.443 28412 31410 34.170 37.566 39,997 


8.897 10.283 11.591] 13.240 29.615 32.67] 35.479 38.932 41.401 
9,542 10.982 12.338 = 14.041 30.813 33.924 36.781 40.289 42.796 
10.196 11.689 13.091 14.848 32.007) 35.172 38.076 = 41.638 = 44.181 
10.856 12.401 13.848 15.659 33.196 36415 39.364 42.980 45.559 
11.524 13.120 14.691 16.473 34.382 37.652 40.646 44.3140 46.928 


12.198 13.844 15.379 47.292 35.563 38.885 41.923 45.642 48.290 
12.879 = 14.573 16.15] 114 36.741 «8940113 43.195 46.963 49.645 
13.565 15.308 16.928 18939 37916 41.337 44461 48.278 50.993 
14.256 16.047 17.708 19.768 39.087 42.557) 45.722, 49.588 52.336 
14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672 


22.164 24433 26.509 29.051 51.805 55.758 59.342 63.691 66.766 
29.707 932.357) = 34.764 37.689) 63.167) 67.505 71.420) 76.154 79.490) 
37.485 40.482 43.188 46.459 74.397 79,082 83.298 88.379 91.952 
45.442 48.758 51.739 55.329 85.527 = ).531 95.023 100.425 104.215 
53.540 57.153 60391 64.278 96.578 [01.879 106.629 112.329 116.32] 


Appendix 5 


The area in the right tail under the F distribution curve is equal to 0.01. 
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The area in the right tail under the F distribution curve is equal to 0.05. 
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Addition rule for the union of two events, 72 
Adjacent values, 57 
Alternative hypothesis, 185 


Bar graph, 15 

Bayes rule, 72 

Bell-shaped histogram, 20, 41 
Between samples variation, 275 
Between treatments mean square, 276 
Bimodal, trimodal, 41 

Binomial experiment, 93 

Binomial probability formula, 93 
Binomial random variable, 93 
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for error, one-way ANOVA, 277 
for numerator, F distribution, 272 
for total, one-way ANOVA, 277 
for treatments, one-way ANOVA, 277 
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Histogram, 20 
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Independent events, 69 
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Interaction plot, 283 
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Interquartile range, 49 
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Interval level of measurement, 5 
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Least squares line, 312 
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Levels of measurement, 4 
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Lower boundary, 17 
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Mean of a discrete random variable, 91 

Mean of the sample mean, 144 

Measures of central tendency, 40 

Measures of dispersion, 42 
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Median class, 45 

Median for grouped data, 45 

Median, 41 

Modal class, 45 

Mode for grouped data, 45 

Mode, 41 

Modified boxplot, 50 


Multinomial probability distribution, 250 
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Nominal level of measurement, 4 

Nonparametric methods, 334 

Nonrejection region, 186 

Normal approximation to the binomial, 125-126 
Normal curve pdf, 117 

Normal probability distribution, 115 

Null hypothesis, 185 
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Observed frequencies, 25! 

Ogive, 21 

One-tailed test, 185 

One-way ANOVA table, 279 
Operating characteristic curve, 194 
Ordinal level of measurement, 5 
Outcomes, 63 


Paired or matched samples, 224 

Parameter, 148 

Pearson correlation coefficient, 320 

Percentage, 15 

Percentile for an observation, 48 

Percentiles, deciles, and quartiles, 48 

Permutations, 74 

Pie chart, 16 

Point estimate, 166 

Poisson probability formula, 97 

Population, | 

Population correlation coefficient, 320 

Population proportion, 148 

Population Spearman rank correlation coefficient, 
344 

Population standard deviation, 42 

Possible outliers, 58 

Prediction interval when predicting a single 
observation, 319 

Prediction line, 313 

Probability, 65 

Probability density function (pdf), 113 

Probability distribution, 90 

Probability, classical definition, 65 

Probability, relative frequency definition, 67 

Probability, subjective definition, 67 

Probable outliers, 58 

Pth percentile, 48 

P value, 196-198 


Qualitative data, 14 
Qualitative variable, 4 
Quantitative data, 14 
Quantitative variable, 3 
Random number tables, 140 
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Random variable, 89 Stratified sampling, 142 
Range for grouped data, 45 Sum of squares of the deviations, 43 
Range, 42 Sum of the deviations, 43 
Rank, 334 Sum of the squares and sum squared, 43 
Ratio level of measurement, 6 Summation notation, 6 
Raw data, 14 Symmetric histogram, 20 
Regression sum of squares, 315 Systematic random sampling, 142 
Rejection regions, 186 
Relative frequency, 15 T distribution, 169 
Research hypothesis, 185 Table of binomial probabilities, 95 
Residuals. 314 Test statistic, 186 
Response variable, 282 Testing a hypothesis concerning: 
Right-tailed test. 185 difference in means, large samples, 213 
Runs test for randomness, 344 difference in means, small samples, 216, 223 
difference in population proportions, 232 
Sample proportion, 148 mean difference for dependent samples, 228 
Sample size determination for estimating the mean, main effects and interaction, 299 
174 population correlation coefficient, 321 
Sample size determination for estimating the population mean, large sample, 191 
population proportion, 175 population mean, small sample, 199 
Sample space, 63 population proportion, large sample, 200 
Sample standard deviation. 43 population variance, 260 
Sample, | Total sum of squares, 277 
Sampling distribution of the sample mean, 142 Treatment sum of squares, 276 
Sampling distribution of the sample proportion, Tree diagram, 63 
148 Two-tailed test, 185 
Sampling distribution of the sample variance, 257 Two-way ANOVA table, 286 
Sampling error, 144 Type I and Type II errors, 187 
Scales of measurement. 4 Type I] errors, calculating, 193 
Scatter plot, 309 
Shortcut formulas for computing variances, 43 Ungrouped data, 40 
Sign test, 334-336 Uniform or rectangular histogram, 20 
Signed rank, 337 Uniform probability distribution, 113 
Simple event. 64 Union of events, 71 
Simple random sainpling. 140 Upper boundary, 17 
Single-valued class. 19 Upper class limit, 17 
Skewed to the left histogram, 20 Upper inner fence, 57 
Skewed to the right histogram, 20 Upper outer fence, 57 
Spearman rank correlation coefficient. 342 
Standard deviation, 42 Variable, 2 
Standard deviation of a discrete random variable, Variance for grouped data, 46 
92 Variance of a discrete random variable, 92 
Standard deviation of errors, 315 Variance, 42 
Standard error of the mean, 144 Venn diagram, 64 
Standard error of the proportion, 149 
Standard normal distribution table, 117 Wilcoxon rank-sum test, 338-340 
Standardizing a normal distribution, 120 Wilcoxon signed-rank test, 337-338 
Statistic, 148 Within samples variation, 275 
Statistical software packages, 7 
Statistics. | Z score, 47 


Stem-and-leaf display, 22 


